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UNSUPERVISED BUILDING AND EXPLOITATION OF COMPOSITE 

DESCRIPTORS 

Field of the Invention 

5 The present invention relates to sequences of symbols and, more particularly, to 

unsupervised building and exploitation of composite descriptors. 



Background of the Invention 

Sequences of symbols are useful in a number of areas. One such area is DNA. 

10 DNA (deoxyribonucleic acid) may be described through a long sequence of symbols. 
DNA is commonly described through the characters A, G, C, or T. These characters may 
be thought of as the alphabet of DNA. Another area where sequences of symbols are 
important is proteins. Proteins are sequences of amino acids, where each amino acid can 
be described by a character or letter. The "alphabet" of amino acids comprises the 

15 characters of A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. Sequences 
of symbols are also important in encryption and coding. For example, computers 
commonly store character data in numeric format. For instance, the word "the" could be 
coded in the American Standard Code for Information Interchange (ASCII) format as 
decimal symbols 116, 104, and 101. Encryption schemes change these numbers to 

20 conceal the underlying information. 

For amino acids, there are very large databases of knowledge that consist of 
sequences of proteins. Similar proteins are usually grouped into "families." Family 
members should have the same properties associated with them; once the properties of 
one of the family members is known, it is assumed that the other family members will 

25 have similar properties. Additionally, once the family is known, the family may be used 
to determine which candidate proteins are members of the family. Therefore, there has 
been tremendous research to determine how to best group proteins into families. 

Generally, there are four different methods used to group proteins. One method 
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is to determine a pattern of symbols that all of the sequences share. This is called a single 
descriptor approach, which looks for particular patterns of characters. The patterns are 
series of expected amino acids, described by alphabetic characters. In the pattern, some 
locations could be important and some locations might not be. An example pattern for a 
5 single descriptor might require certain amino acids to be in one particular location, then 
allow several "don't care" locations where any amino acid could reside, and then require 
only a particular amino acid in a final location. The patterns are based on observations 
that, in nature, specific amino acid positions seem to be preserved in a biased way. These 
specific amino acids positions are "conserved" even though their neighbors can undergo 

10 mutations. Thus, researchers used the concept of conservation to describe the members 
of the family. A very large, well known database of the single descriptor type is the 
Prosite database. There are about 1100 families in this database. To find the patterns 
contained in each family, the proteins contained there were first aligned. Then, the most 
conserved region of the family was located and the pattern (the single descriptor) 

15 contained in all or most of the family members was determined. However, there could be 
members of a family that did not share the single descriptor. This generates false 
negatives, as members of the family were incorrectly not discovered as such. 

An improvement on the single descriptor method is the composite descriptor 
method. The composite descriptor method examines a candidate protein for several 

20 alphabetic patterns, as opposed to only one pattern with the single descriptor method. 
Again, this method generally requires aligning the proteins so that the multiple patterns, 
i.e., the composite descriptor, properly align within their respective blocks. 

The conceptual underpinnings are the same across all the methods that rely on 
composite descriptors. Any differences have essentially to do with either the manner in 

25 which multiple alignments are used to construct the descriptors or whether the descriptors 
are explicitly (e.g., a "regular expression") or implicitly (e.g., a "profile") represented in 
the composite description. Additional characteristics common to these approaches 
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include; (a) an iterative component; (b) the availability of a set of known (or alleged) 
family members (= "training" set) that provides an initial "bootstrapping" stage; (c) the 
computation of a multiple-sequence alignment involving members of the training set - 
these alignments are typically verified manually or semi-automatically and can be used to 
5 derive profiles that allow the generation of quality measures when evaluating the results; 
(d) a range of quality control checks that are optionally applied on the generated results; 
and, (e) the need to study the collection under consideration in order to identify a 
minimum set of components that will form the composite description. 

There are several problems with these approaches. For instance, in step (c), it is 

10 implicitly assumed that there is a multiple-sequence alignment involving all of the 
members of the training set; the alignment may either be a global alignment of both 
conserved and non-conserved regions, or a local alignment of the most conserved regions. 
This requirement unnecessarily burdens these methods. Additionally, multiple alignment 
programs usually work best when the parameters are optimized for the set of sequences 

1 5 which are being considered. 

Steps (d) and (e) presuppose the availability of biological information pertaining 
to the set under consideration, and this biological information may not always be present. 
As a matter of fact, step (e) results in the selection and use of features which are 
conditional on each other. Although easy to describe, an additional assumption here is 

20 that the identity, cardinality, and properties of these features are available and also agreed 
upon ahead of time. For example, a statement such as "G protein-coupled receptors 
(GPCRs) are proteins involved in signal transduction in eukaryotic organisms that consist 
of seven transmembrane helices composed typically of hydrophobic amino acids" 
represents a body of knowledge that has been used by researchers in the building of 

25 composite descriptors for GPCRs. With the supervised approaches described above, a 
detailed and frequently manual study of the collection under consideration is unavoidable. 
In addition to descriptor approaches, there are also "windowing" approaches that 
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build descriptors for a family. In these methods, one or more windows are used instead of 
character patterns. A single window method is called the PROFILE approach. All of the 
sequences of each of the family members are aligned with respect to their best-conserved 
region. Researchers then determined a probability distribution for locations in each 
5 column of the implied window. For each such block, they determined a probability of 
expecting an amino acid at some location within the window and thus built a 'profile' of 
expected probabilities for each of the columns of the window. The researchers would 
slide this set of probabilities against an unknown protein. If this candidate protein 
matched the expected probabilities, they included the protein as a member of the family. 

10 This approach was more tolerant than the single descriptor approach. Subsequently, 
researchers began to use profiles for multiple widows. There could be two, three, four 
windows where the members of the family could agree on content. Sometimes, a profile 
was not built explicitly but rather was maintained as a collection of the instances across 
the known or alleged family members of the conserved region under consideration. 

15 The windowing methods again rely on alignment of proteins, which can be 

relatively complex and computationally lengthy. Typically, these windowing methods are 
supervised and biological information pertaining to the family can facilitate the analysis. 
With supervised approaches, a detailed and frequently manual study of the collection 
under consideration is unavoidable. 

20 Therefore, there exists a need to provide a way of determining and using family 

members of sequences in an unsupervised manner, without knowledge of biological 
information related to the family, and without aligning the sequences. 

Summary of the Invention 

25 Generally, the present invention provides a way of determining in an 

unsupervised manner additional members for a family that is defined initially through 
exemplar sequences. The present invention is unsupervised in that it proceeds without 
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any information related to the exemplar sequences defining the family, without aligning 
the exemplar sequences, without prior knowledge of any patterns in the exemplar 
sequences, and without knowledge of the cardinality or characteristics of any features that 
may be present in the exemplar sequences. The cardinality of a set is the number of items 
5 in a set. For instance, the cardinality of the set of letters in the English alphabet is 26. In 
one aspect of the invention, a method is used to take a set of unaligned sequences and 
discover several or many patterns common to some or all of the sequences. These 
patterns can then be used to determine if candidate sequences are members of the family. 
In another aspect of the invention, a method is used to take a set of sequences and to 

10 determine a set of maximal patterns common to a number of sequences. The maximal 
patterns are determined without any previous knowledge about any properties or features 
that may be present in the processed sequences. 

A more complete understanding of the present invention, as well as further 
features and advantages of the present invention, will be obtained by reference to the 

15 following detailed description and drawings. 



Brief Description of the Drawing s 

FIG. 1 is a schematic block diagram showing an architecture of a system 
for unsupervised building and exploitation of composite descriptors in accordance with 
20 an embodiment of the present invention; 

FIG. 2 is flow chart describing unsupervised building and exploitation of 
composite descriptors employed by the system of FIG. 1; 

FIG. 3 is a histogram of the scores for the sequences of RAND-SP when 
processed by the composite descriptor for an 80-sequence G protein-coupled receptor 
25 training set; and 

FIG. 4 is a histogram of the scores for the sequences of RAND-SP when 
processed by the composite descriptor for a 70-sequence helix-turn-helix training set. 
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Detailed Description of Preferred Embodiments 

5 Generally, the present invention provides a way of determining in an 

unsupervised manner additional members for a family that is defined initially through 
exemplar sequences. The present invention is unsupervised in that it proceeds without 
any information related to the exemplar sequences defining the family, without aligning 
the sequences, without prior knowledge of any patterns in the exemplar sequences, and 

1 0 without knowledge of the cardinality or characteristics of any features that may be present 
in the exemplar sequences. The cardinality of a set is the number of items in a set. For 
instance, the cardinality of the set of letters in the English alphabet is 26. In one aspect 
of the invention, a method is used to take a set of unaligned sequences and discover 
several or many patterns common to some or all of the sequences. These patterns can 

15 then be used to determine if candidate sequences are members of the family. In another 
aspect of the invention, a method is used to take a set of sequences and to determine a set 
of maximal patterns common to a number of sequences. The maximal patterns are 
determined without any previous knowledge about any properties or features that may be 
present in the processed sequences. 

20 As previously stated, the present invention provides a way of determining 

family members in an unsupervised manner. By "unsupervised" it is meant that no 
predetermined or a priori information is needed/known about the exemplar sequences or 
is employed by the discovery process. Additionally, there is no need for user supervision 
or intervention. For instance, the present invention does not require knowledge of 

25 biological information related to the family, aligned sequences, knowledge of properties 
of the exemplary sequences defining the family, and/or knowledge of the cardinality or 
characteristics of features of the exemplar sequences. It is possible to exclude one or 
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more of the these restrictions. For instance, the present invention could be used on a set 
of aligned sequences. The present invention would still determine a composite descriptor 
suitable for examining candidate sequences and either including these sequences in or 
excluding them from the family. However, a great benefit of the present invention is that 
5 it does not need aligned sequences or the knowledge of predetermined properties and 
features that may be present in the exemplar sequences. Aligning sequences and 
determining properties and features in the exemplar sequences that originally define the 
family is time consuming, complex, and at times intractable. Instead, the present 
invention can determine a composite descriptor without such time intensive efforts. 

10 Concerning features and properties of a sequence of symbols, it is not easy 

to define what a feature is. The definition of a feature is directly related to the 
representation of the items that are studied, i.e., the way each of the objects processed by 
the system under consideration is represented and stored in a computer. Such a 
representation is in turn related to the way an object can appear in the context of the 

15 sensor data, and is unavoidably application specific. For example, in the context of image 
processing by a computer, the following image characteristics have been used as features: 
linear and curvilinear segments, curvature extrema, curvature discontinuities, and 
identifiable conies. In the context of computational biology, an example of a feature can 
be a combination of amino acids with understood behavior and possibly known 

20 3-dimensional structure. For instance, for a helix-turn-helix (HTH) motif that mediates 
the binding of many regulatory proteins to regulatory control sites of DNA, the two 
features are the two helices at the beginning (7 a.a.) and the end (9 a.a.) of the 20 a.a. 
stretch that corresponds to an instance of the HTH motif. A property can be thought of as 
an attribute of a feature: in the case of the HTH, a property would be the fact that the two 

25 features (helices) are held together through non-polar interactions of their side chains. It 
should be stressed that the concept of the feature is also intrinsically connected to the task 
at hand. For example, for some applications, individual a.a. letters can be thought of as 
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"features." 

What is important is that previously researchers had to (a) know 
something about the set of sequences, or (b) align the exemplar sequences, or (c) perform 
both (a) and (b) before they could determine those motifs that were peculiar to the 
5 exemplar sequences and, thus, by extension specific to and characteristic of the family 
defined by the exemplar sequences. The researchers knew and exploited properties of 
sequences, knew and exploited features of the sequences, and/or aligned the sequences. 
The present invention is unsupervised, meaning that no information about the exemplar 
sequences need be known, and the present invention will still determine patterns that can 

10 subsequently be used to define the family implied by the exemplar sequences as well as 
analyze candidate sequences for inclusion into this family. 

In an embodiment of the present invention, a training set of family 
members is searched in an unsupervised manner to determine statistically significant, 
common patterns between some or all of the family members. Each family member 

15 comprises a sequence, which itself comprises a series of characters. The present 
invention may be used on any sequence of symbols that can be described as a linear 
stream of events, e.g., DNA (deoxyribonucleic acid), proteins, languages, and numbers. 
Preferably, a predetermined sequence-support threshold will initially be set. This 
predetermined sequence- support threshold determines how many of the sequences in the 

20 family need to have a pattern for the pattern to be considered common to the training set. 
For instance, if there are 100 sequences in the family, the predetermined 
sequence-support threshold could be set to 50. This means that a pattern must be found 
in 50 of the sequences for the pattern to be considered common to the family members in 
the training set. Generally, this threshold is initially set to the number of sequences in the 

25 training set. Should no common patterns be found, the sequence threshold may be 
modified. 

If common patterns are found, they are examined to determine if they are 
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statistically significant. Any remaining statistically significant patterns may be used to 
describe the family members and, subsequently, to ascertain if candidate members are 
part of the family. Preferably, the statistically significant and common patterns become 
part of a composite descriptor. Once the statistically significant and common patterns are 

5 found for a set (which could include all) of the family members, the sequences containing 
the patterns are removed from the training set. This results in a smaller training set. 

This modified training set is again searched for common patterns. The 
sequence threshold may be modified to search for fewer sequences of the modified 
training set or to search for all of the sequences in the training set. If any common and 

1 0 statistically significant patterns are found, the composite descriptor is modified to add the 
new patterns. This process preferably continues until either all sequences are removed 
from the training set or until common patterns cannot be found between the remaining 
sequences. 

Once the composite descriptor is determined, the composite descriptor 
1 5 may be used to determine if a candidate sequence is part of the family, hi particular, the 
composite descriptor may be used to search a database of sequences to determine if 
individual sequences in the database are members of the family described by the 
composite descriptor. Usually, a pattern-support threshold will be used to make this 
determination. The pattern-support threshold determines the number of patterns that 
20 must match between the candidate sequence and the patterns in the composite descriptor. 
For example, if there are 1000 patterns in the composite descriptor, the pattern-support 
threshold may require matches on 995 of the patterns for the candidate sequence to be 
considered a member of the family. Moreover, after more members of the family are 
found by using the current composite descriptor, these new members may be added to the 
25 original training set to create a new training set. The composite descriptor method may 
again be run on the new training set. This will provide even greater sensitivity and allow 
the composite descriptor to "learn" new patterns common to the family. 
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While the present invention can determine statistically significant and 
common patterns with aligned sequences, the present invention does not need aligned 
sequences. To align two sequences, one or more patterns common to both sequences are 
aligned in a left-to-right order. For example, assume that the pattern being aligned is 
5 ABC. The sequence of characters {DEFXYZABC} would be aligned with {ABCDEF} 
by either aligning the ABC patterns in a left-to-right manner or by aligning the DEF 
patterns. Thus, when aligning the ABC patterns, the XYZ of the first pattern would not 
be aligned with characters in the second pattern and the DEF of the second pattern would 
not align with characters in the first pattern. For this example, there is no unique 
1 0 alignment and it is easy to see how the situation can be complicated further as the number 
of sequences to process increases. Because the present invention preferably searches for 
patterns common to the sequences, the present invention would determine that ABC was 
common to the two sequences, regardless of their alignment. 

The present invention also does not need the availability of biological 
15 information related to the family. While such information could be used, the present 
invention will determine statistically significant and common patterns within the family 
members without biological information. Moreover, because outliers are expected to not 
contribute much in the way of statistically significant patterns to the composite descriptor, 
outliers have less of an impact on the present invention. 
20 Turning now to FIG. 1, FIG. 1 is a schematic block diagram showing the 

architecture of an illustrative system 100 in accordance with the present invention. 
System 100 may be embodied as a general purpose computing system, such as the general 
purpose computing system shown in FIG. 1. System 100 includes a processor 110 and 
related memory, such as a data storage device 120, which may be distributed or local. 
25 The processor 110 may be embodied as a single processor or a number of local or 
distributed processors operating in parallel. Such processors could communicate through 
a common bus or through one or more networks. The data storage device 120 is operable 
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to store one or more instructions and data, which the processor 1 10 is operable to retrieve, 
interpret, execute and use. Data storage device 120, in this example, comprises a 
composite descriptor method 200, a composite descriptor 130, a training set 140, a 
database 150, and discovered family members 160. Not all of these need be present at 
5 any one time. In general, the composite descriptor method 200 will examine the training 
set 140 for common and statistically significant patterns. Training set 140 comprises a 
number of sequences, each of which comprise a series of symbols. Each symbol comes 
from a collection of possible symbols referred to as an alphabet. The alphabet could 
describe such entities as DNA (deoxyribonucleic acid) or proteins. The composite 
10 descriptor 130 will be modified to add any common and statistically significant patterns 
that are found. Database 150 contains a number of candidate sequences. Once a 
composite descriptor 130 is created, the composite descriptor may be used to determine 
which, if any, of the candidate sequences in the database 150 are part of the family of 
sequences described by composite descriptor 130. If any candidate sequences are 
15 determined to belong to the family, these candidate sequences may be stored in the 
discovered family members area 160. If desired, the discovered family members 160 
may be added to the training set 140 to create a new training set 140. Composite 
descriptor method 200 may then act on this new training set 140 to further refine 
composite descriptor 130. 
20 As is known in the art, composite descriptor method 200 may be 

distributed as an article of manufacture that itself comprises a computer readable medium 
having computer readable code means embodied thereon. The computer readable 
program code means is operable, in conjunction with a computer system such as 
computer system 100, to carry out all or some of the steps to perform the composite 
25 descriptor method 200. The computer readable medium may be a recordable medium 
(e.g., floppy disks, hard drives, Compact Disks, or memory sticks), or may be a 
transmission medium (e.g., a network comprising fiber-optics, the world-wide web, 
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cables, or a wireless channel using time-division multiple access, code-division multiple 
access, or other radio-frequency channel). Any medium known or developed that can 
store information may be used. 

Composite descriptor method 200, as shown in FIG. 2, performs 
5 unsupervised building of composite descriptors and then exploits the determined 
composite descriptors to find additional family members of the families described by the 
composite descriptors. Method 200 is performed whenever it is desired that a composite 
descriptor be determined and used. It should be noted that method 200 may be broken 
into multiple sections. Preferably, the steps up to step 280 would be used to determine a 

10 composite descriptor from a training set, while optional step 260 would be used to apply 
the composite descriptor to one or more candidate sequences, and optional steps 270 and 
275 would be used to further refine the composite descriptor. 

Method 200 begins in step 205 when a training set is provided. It should 
be noted that the sequence of steps are not necessarily in order. The training set, T, is 

15 preferably N unaligned sequences Si for which there is reason to believe that the 
sequences are related. There should exist identifiable local similarities among members 
of T at the amino acid level, although it is assumed that no other information is available 
for the members of T, e.g., known or identifiable secondary structures, known or 
identifiable domains, functional information, physio-chemical properties, or physical 

20 properties. If no identifiable local similarities exist among members of T, method 200 
will not provide a suitable composite descriptor for the family, as a composite descriptor 
does not exist for the family. 

Each sequence is a series of symbols from an alphabet. For proteins, one 
can denote by S the alphabet of all amino acids; i.e., I={A, C, D, E, F, G, H, I, K, L, M, 

25 N, P, Q, R, S, T, V, W, Y}. On this alphabet, regular expressions can be defined that can 
range from very simple n-grams to more general ones containing wild cards and capturing 
strings of variable length. The V (referred to as the "don't care character") is used to 
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denote a position in a sequence or pattern that can be occupied by an arbitrary residue. A 
bracket is meant to denote a "one of " choice; i.e., [KR] means that the position this 
bracket corresponds to can be occupied by exactly one of K or R. A bracket can have a 
minimum of 2 alphabet characters but not more than I E| - 1 . 
5 In step 210, the sequence threshold, K, is set. It is possible to set K=|T|, 

which is the number of sequences in the training set. In actuality, it has proven beneficial 
to assign a small starting value to K that is a fraction of the number of sequences in T. 
Experiments have shown that a starting value of K=|T|/b with b=4 or 5 is good choice 
across many data sets. Note that the smaller the value of b, the higher the redundancy of 
10 the composite descriptor will be. The selection of K also can depend on how conserved, 
or similar, the family members are. If the family members are well conserved, then K can 
be higher; if the family members are not well conserved, then K can be lower. 

In step 215, a set of maximal patterns in the K sequences is determined. In 
general, this step tries to determine common patterns between the K sequences. Not only 
1 5 should the patterns be common, but they should also be as large as possible. These large 
patterns may further be mathematically defined as "maximal" in a way described below. 
Any of the available algorithms which can guarantee that all sought patterns are 
discovered and that they are maximal can be used here. For the experiments related 
below, a Teiresias algorithm was used. This algorithm is described in Floratos, et al., 
20 U.S. Patent No. 6,108,666, "Method and Apparatus for Pattern Discovery in 
1-Dimensional Systems"; Floratos, et al., U.S. Patent No. 6,092,065, "Method and 
Apparatus for Discovery, Clustering and Classification of Patterns in 1-Dimensional 
Event Streams"; Rigoutsos, I. and A. Floratos, "Combinatorial Pattern Discovery in 
Biological Sequences: the Teiresias Algorithm," Bioinformatics, 14(l):55-67, 1998; and 
25 Rigoutsos, I. and A. Floratos, "Motif Discovery Without Alignment Or Enumeration," 
Proceedings 2nd Annual ACM International Conference on Computational Molecular 
Biology, New York, NY, March 1998, the disclosures of which are incorporated by 
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reference herein. 

A short introduction to this method follows. A pattern S is a regular 
expression on I that defines a language G(S). The elements of the language are all the 
strings that can be obtained from the regular expression that S stands for. A protein is 
5 said to match a given pattern S if and only if it contains at least one substring (i.e., a 
block of consecutive residues) that belongs in G(S). A pattern S' is said to be more 
specific than a pattern S if G(S')<=G(S). Given a pattern S and a database D, an offset list 
of a pattern of S may be defined with respect to D (or simply the offset list of S, when the 
database D is unambiguously implied) to be the following set; L D (S) = {(i, j) | the i-th 

10 sequence of the database D matches the pattern S at offset j}. A pattern S is called 
maximal with respect to a database D if there exists no pattern S 5 which is more specific 
than S and such that |L D (S)| = |L D (S')|. A maximal pattern cannot be made more specific 
without simultaneously reducing the cardinality of its offset list. A pattern S is called an 
<L,W> pattern (with L<W) if every substring of S with length W contains L or more 

15 non-don't care positions. Note that a given choice for the parameters L and W has a 
direct bearing on the degree of remaining similarity among the instances of the domain 
that is captured by the regular expression: the smaller the value of the ratio L / W, the 
higher the degree of sought similarity. 

The Teiresias algorithm is a pattern discovery algorithm that can guarantee 

20 the discovery of all <L,W> patterns that are maximal and supported by K or more input 
sequences. The pattern discovery is carried out while allowing the symbols of 2 to be 
partitioned in equivalence classes. Any symbol within a given class is able to replace any 
other symbol of the (same) class. One such example would be the partition: {A, G}, C, 
{D, E}, {F, Y}, H, {I, L, M, V}, {K, R} ? {N, Q}, P, {S, T}, W. In fact, the various 

25 symbol classes do not have to form a partition of 2. In other words, a given symbol can 
belong to more than one class. One such set of classes can be obtained by using a 
distance threshold with any of the currently available scoring matrices such as the PAM 
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and BLOSUM series. PAM is described in Dayhoff, "Atlas of Protein Sequence and 
Structure," vol. 5, National Biomedical Research Foundation, 1978; and BLOSUM is 
described in Henikoff, "Amino Acid Substitution Matrices from Protein Blocks," Proc. 
Natl. Acad. Sci. USA, 89:100915-100919, 1992, the disclosures of which are 
5 incorporated by reference. 

The Teiresias algorithm permits the discovery of all <L,W> patterns that 
are maximal and supported by K or more input sequences, in the presence of stated 
equivalences involving symbols from the input alphabet. Each pattern S that the 
Teiresias algorithm will discover is of the form: 

10 

(S U [2I*E]) (2 U U U)*(2 U [SL*S]). 

Associated with each pattern S is the sensitivity of the pattern, which is 
directly related to the number of sequences in D that contain S. The sensitivity is a 

1 5 measure of how many members of the training set T do not match S (= false negatives). 
Also associated with S is the pattern's specificity, which is a direct measure of how many 
members of the database D match the pattern, but are not true members of the collection 
that the training set T represents (= false positives). The choice of the values for the 
parameters L and W is a function of the collection under consideration. Experimental 

20 work has shown that a choice supporting moderate degree of local similarities (e.g., 
-40-50%) is a good choice across a very large variety of test cases. 

In step 225, it is determined if any patterns are found. In no patterns are 
found (step 225 = NO), the sequence threshold, T, can be decreased. Preferably, this is 
done by setting K=|T|/b, where b is usually set to 4 or 5. It is also possible to set b to 

25 smaller values, such as 2 or 3. Setting b to smaller values increases the amount of 
processing time it might take to determine maximal patterns. For instance, if there are 
1000 sequences in T and K = |T| = 1000, and no common maximal patterns are found, it 
is necessarily the case that changing K to 999 will not find any common maximal 
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patterns. Changing K from 1000 to 250, however, will make it more likely that common 
maximal patterns may be found. After K has been changed (step 230), it is determined if 
K meets a predetermined minimum limit. This limit has been set, in the example of FIG. 
2, as 2. If there are two (or even more) sequences that have a pattern, even a maximal 

5 pattern, in common, this pattern may not be representative of the family members. In step 
220, other minimum sequence-support thresholds may be used, if desired. The choice of 
the predetermined minimum limit is not critical, as outliers (those sequences that are the 
"edge" of the family or even not part of the family) are expected to have little or no 
bearing on the composite descriptor of the present invention. This is discussed in more 

1 0 detail below, in reference to step 260. 

If maximal patterns are found in step 215 (step 225 = YES), in step 235, it 
is determined if the maximal patterns are statistically significant. In general, in step 235, 
it is determined, for each maximal pattern, what the probability is that the maximal 
pattern occurs in a sequence. This probability should meet a predetermined threshold. 

15 This step is important because the patterns will be exploited, as part of the composite 
descriptor, to determine additional family members. If relatively general patterns are 
used, the patterns could include candidate members into a family when the candidate 
members are not members of the family. For instance, for the English language, the 
pattern "the" is much more likely to appear in a sentence than is the pattern "quit." The 

20 pattern "the" would be much more likely to include candidate members as part of the 
family than would the pattern "quit." This would be appropriate if the family was defined 
as any sentence having the pattern "the." However, a much more likely occurrence is to 
define a sentence as any sentence having the pattern "quit," and if the pattern "the" is 
used as part of a composite descriptor, it is possible that this pattern will generate too 

25 many false family members. 

From the set of maximal <L,W> patterns that are discovered, the set M s is 
selected that contains only those that are statistically significant. With appropriate 
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modifications, any of several published methods can be used at this step, the disclosures 
of which are herein incorporated by reference: Atteson, "Calculating the Exact 
Probability of Language-like Patterns in Biomolecular Sequences," Proceedings of the 
Sixth International Conference on Intelligent Systems for Molecular Biology (ISMB '98), 

5 Menlo Park, California, AAAI Press, 1998; Jonassen, ICollins, and Higgins, "Finding 
Flexible Patterns in Unaligned Protein Sequences," Protein Science, pp. 1587-1595, 
1995; Nicodeme, Salvy and Flajolet, "Motif Statistics," INRIA Technical Report No 
3606, January 1999; Pevzner, Borodovksi and Mironov, "Linguistic of Nucleotide 
Sequences: the Significance of Deviation from Mean Statistical Characteristics and 

10 Prediction of the Frequencies of Occurrences of Words," Journal of Biomolecular 
Structure Dyn., 6:1013-1026, 1989; Regnier, "A Unified Approach to Word Statistics," 
Proceedings 2nd Annual ACM International Conference on Computational Molecular 
Biology, New York, NY, March 1998; Sagot, and Viari, "A Double Combinatorial 
Approach to Discovering Patterns in Biological Sequences," Proceedings of the Seventh 

15 Symposium on Combinatorial Pattern Matching, pp. 186-208, 1996; Sewell and Durbin, 
"Method for Calculation of Probability of Matching a Bounded Regular Expression in a 
Random Data String," Journal of Computational Biology, 2(1):25-31, 1995; and 
Wooton, "Evaluating the Effectiveness of Sequence Analysis Algorithms Using Measures 
of Relevant Information," Computers Chem., 21(4):191-202, 1997. 

20 For simplicity, the probabilities of the discovered patterns, as disclosed in 

the Examples section below, were determined with the help of a 2nd order Markov chain 
method, as described in Salzberg, Delcher, Kasif, and White, "Microbial gene 
identification using interpolated Markov models," Nucleic Acids Res., 26(2):544-8, 1998, 
which is incorporated herein by reference. The natural logarithm of the estimated 

25 probability was used as the measure of a pattern's significance. This threshold can be 
estimated as a function of the size of the database to be searched with the composite 
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descriptor. 

The cardinality of the sub-selected set M s of patterns ought to be high 
because of the redundancy of sequence segments from T that are captured by the patterns. 
This will guarantee a strong signal-to-noise ratio when the composite descriptor is used as 

5 a predicate. It is worth pointing out at this point that even if the training set has just a few 
members, the cardinality of M s (and thus the redundancy) can be high since there is a 
multitude of patterns that one can generate even from a few sequences. 

Once the statistically significant patterns are found, these patterns are 
removed from the training set, T, of sequences. This occurs in step 240. It should be 

10 noted that steps 240, 245 and 250 do not have to occur in this order and could even occur 
in parallel. Preferably, each sequence of the training set is examined to determine 
whether it matches any of the significant patterns of M s . After all patterns of M s have 
been exhausted, all sequences that matched one or more patterns are added to a temporary 
set A. Upon completion of the iteration, one or more sequences from T will have been 

15 entered into the set A; these are essentially the sequences that have been accounted for. 
What remains of T after the removal of these sequences, i.e., T \ A, is used as the training 
set for the next iteration. Thus, the training set T is modified (step 245), which could 
include marking which sequences in an array of sequences are no longer valid, or copying 
the remaining sequences into a new array. 

20 In step 250, the composite descriptor is modified. Preferably, the 

composite descriptor is a union of the composite descriptor and the set M s . The set of 
significant patterns M s which was discovered during this last iteration is added to the 
composite descriptor by adding those patterns in M s that are currently not in the 
composite descriptor. 

25 In step 255, it is determined if the training set, T, is empty. If the training 

set is not empty (step 255 = NO), the method continues in step 215 and repeats. If the 
training set is empty (step 255 = YES), and after step 220 = YES, the method ends in step 



YOR920000435US1 



-18- 



280. Optionally, Steps 260, 270 and 275 may be performed at this point. 

At the end of this stage, the composite descriptor contains a set of patterns 
that by design are specific and sensitive for the collection that the training set T 
represents. Several properties distinguish this composite descriptor from previous 

5 collections of patterns, such as the Prints database of patterns. For example, the building 
of the composite descriptor is automatic, it does not require manual intervention and does 
not necessitate the computation of multiple alignments. Additionally, there is no need for 
biological knowledge specific to the training set T that will impose helpful constraints 
during generation of the composite descriptor. Also, highly similar sequences need not 

10 be removed from the training set prior to the building of the composite descriptor. 
Additionally, as discussed below in reference to step 260, the training set can safely 
contain a small percentage of potential outliers, i.e., sequences that have questionable 
membership in the collection that the training set represents. Because of the redundant, 
iterative nature of the building phase, the resulting composite descriptor is not expected to 

15 contain any statistically significant patterns that are shared by both the outliers and the 
rest of the sequences in T. Through the initial selection of the support value (small K) 
the composite descriptor can be made sensitive and contain patterns that are specific for 
the set T (i.e., large probability threshold, Thr pro b). Finally, the fact that the composite 
descriptor contains all those patterns which are specific, significant, and which by design 

20 account for every member of the training set, guarantees a strong signal-to-noise ratio 
when using composite descriptor as a multi-valued predicate (which takes place in step 
260). Steps 205 through 255 may be expressed in pseudo-code as follows. 



i) CompDescr *- 0 

ii) K *- |T| ( or K «- max(2, \T\fb ) - see also text) 

iii) discover the set M of all <L,W> maximal patterns m T 
that are supported by at least K sequences ot 1 

iv) if(|MJ = 0)then . 

if ( K = 2 ) terminate ; 
set K - K- 1 (or K «- max(2, K/b ) 



continue with step iii) 
end-if 
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In step 260, the composite descriptor is exploited to determine if candidate 
sequences are members of the family described by the composite descriptor. Generally, a 
database of sequences (such as database 150 of FIG. 1) will be searched, but individual 
sequences may also be compared against the composite descriptor. The composite 
5 descriptor will be a number of common, statistically significant and maximal patterns that 
describe the family. As such, the composite descriptor acts much like a dictionary to 
describe the family. It can be used in step 260 to determine additional members of the 
family. 

Because method 200 relies on searching through a family and determining 

10 the common, statistically significant and maximal patterns that compose the composite 
descriptor, outliers tend not to matter as much for the present invention. An outlier is a 
sequence that has been erroneously included within the family. Some simple examples 
will help to explain why outliers are not a hindrance to the present invention. 

Assume that there are 100 members of the family; assume also that 93 

15 members of the family are accounted for but there are 7 outliers that were erroneously 
included as members of the family. Since, by definition, the latter set comprises the 
outliers, it is generally true that the number of patterns that will be shared among them 
and the remaining 93 sequences should be very small (if not 0) when compared to the 
number of patterns that will be shared by the 93 truly related sequences. This will thus 

20 generate very small (if any) support for sequences that are not true members of the family 
being studied. Moreover, these erroneous patterns will be further filtered out through the 
statistical significance filtering stage. Finally, when the composite descriptor, which 
contains patterns common to all 100 sequences, is used to determine if a new sequence is 
part of the family, the composite descriptor will be used with a pattern-support threshold. 

25 In other words, there will be some minimum number of patterns that the new sequence 
must have in order to be considered part of the family. This threshold will usually be 
high enough such that outliers, even if they contribute patterns, will not cause non-family 
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members to be included within the family. 

In step 260, the composite descriptor can be used as a multi-valued 
predicate that can determine the membership of a query sequence in the collection that the 
original training set T defines. The composite descriptor can be used to examine a 
5 candidate-for-membership-in-T sequence S ca nd for instances of the permitted patterns. 
Given S ca nd, as many local counters as the length of the sequence may be allocated and 
initialized to 0. A global counter for the sequence may also be allocated and also 
initialized to 0. If it is determined that a segment of the query matches a pattern m, the 
local counters at the sequence positions matching the pattern are incremented by an 

10 amount equal to d. The possible choices for d include among others "the number of 
occurrences o m of m in T" and "the number 1." The former choice favors segments that 
match patterns supported by a lot of sequences in T whereas the latter gives 
comparatively increased support to segments that are only moderately conserved. The 
choice for the amount d by which to increment the local counters modifies the semantics 

15 of the predicate' s output value. 

If the value of d is set to T then the predicate is a measure of how many 
distinct patterns generated from T are matched by the query sequence. In this case, large 
values indicate that the result is corroborated by multiple patterns which are specific for 
the collection T. Smaller values are at the very minimum indicative of the existence of 

20 local similarities that are shared by the query and one or more members of the training set 
T. Such similarities can imply one of two things: either the query is a true but distant 
member of the collection under consideration or it is not a true member but it nonetheless 
shares one more regions of similarity with members of the collection. 

If the value of d is set to 'the number of occurrences' of the respective 

25 pattern in the training set T, the predicate is a measure of how many distinct sequence 
fragments in T are similar to the respective query fragment. Large values indicate regions 
that are shared by a large number of sequences in T and can be indicative of a conserved 
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active site, for example. Both choices of d have merit and the one to use depends on the 
task at hand. 

Independent of what the choice for d is, every time a segment of S can d 
matches a pattern m, the global counter associated with S can d is incremented by d. After 
5 all of Scand have been examined, the values of the global counter are inspected for S ca nd; if 
they exceed Thres rand , S can d is reported as a candidate for membership in the collection 
defined by T. 

The value of Thres rand depends on the actual contents of the composite descriptor and 
can be determined as follows: beginning with the composite descriptor that was built 

10 from the training set T, one can scan as outlined above a randomized version of a very 
large database such as GenPept or Swiss-Prot. Essentially, each sequence of such a 
database is treated as a potential query. Upon completion of the scanning process, one 
can accumulate support for all the sequences that matched one or more patterns of the 
composite descriptor and histogram the support values to obtain their distribution. The 

15 value of Thres ran d may be determined by identifying the q-th percentile of this last 
distribution. Typically, q is set to 95 or higher. 

After step 260 has been performed, it is possible to take the new members 
found and add them to a new training set that comprises the old training set and the new 
members. Then steps 205 through 280 may be run again (step 275) to further refine the 

20 composite descriptor for this family. Thus, the present invention allows learning to be 
performed, if this is desired. 

The present method does not suffer from drawbacks related to (a) the need 
for good multiple sequence alignments, (b) the inclusion of outliers, (c) the inherent 
dependence of the results on the selection of the scoring matrix that is used, and (d) 

25 overtraining. Indeed, building of the composite descriptor does not require the 
computation of any multiple sequence alignments, whereas the redundancy of 
representation that is inherent in composite descriptor is expected to more than 
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counterbalance the inclusion of any small number of outliers. Additionally, this will 
prevent the system from including even more outliers during the following iteration. 
Moreover, after each iteration, only the sequence fragments whose support exceeds 
threshold are considered thus allowing the process to remain 'focused' on what has been 
5 deemed important and relevant for the dataset under consideration. 

Finally, it should be noted that the training set T, which is given at the 
very beginning of this iterative process, impacts on the quality of the results (i.e., 
sensitivity and specificity) that the method will produce. For example, if the original 
training set is not sufficiently representative of all instances of a family's members (e.g. 

10 GPCRs), or of the construct of interest (e.g. the helix-turn-helix DNA binding motif), the 
generated composite descriptor should not be expected to discover all instances relating 
to the training set. This last observation holds true for all methods that try to build single 
or composite descriptors by starting with a training set T. Since the augmented training 
sets at the beginning of the i+l-st iteration preferably only comprise the sequence 

15 fragments which exceeded threshold during the i-th iteration, the composite descriptor 
will maintain its 'focus' on what is essentially dictated by the original training set. That 
is not to say that that the composite descriptor will not be sensitive; on the contrary, the 
composite descriptor will be sensitive to the extent that the processed data permit while at 
the same time remaining in lock-step, so to speak, with the originally provided training 

20 input. As a matter of fact, the experimental results discussed below on three specific 
datasets demonstrate that even starting with small training sets allows discovery of a large 
number of representatives of the same group. 

EXAMPLES 

25 Now that the method and apparatus have been described, some exemplary 

results are shown in this section. In this section, results are described from the building 
and use of composite descriptors for three distinct collections of data. The collections 
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were chosen in such a way so as to showcase the ability of the present invention to handle 
input sets across a variety of contexts. 

The first collection comprises sequences from PROSITE entry PS50040 of 
elongation factor 1 gamma chain sequences; in Release 15.0 of the PROSITE database, 
5 only a matrix profile is available for this collection. 

The second collection comprises complete sequences as well as fragments 
of G protein-coupled receptors, a very important and diverse family of proteins that has 
traditionally been used as a benchmark test for gauging the quality of pattern-based 
approaches. 

10 Finally, the third collection comprises sequence fragments that are known 

to contain an instance of the helix-tura-helix DNA binding motif, a structural motif of 
great importance. 

First, the composite descriptors were built for each of the three collections 
and evaluated by treating the sequences in Swiss-Prot Release 38.0 as candidates for 
1 5 membership in each of the respective three collections. 

Once the behavior of the descriptors is characterized in the context of 
Swiss-Prot, the 19,099 ORFs were searched in the complete genome of Caenorhabditis 
elegans and these results reported below. 

Before proceeding, here are some methodological details and parameter 
20 choices that are common in all three cases. In particular, the value d, by which the 
counters are incremented, is set to 1, essentially favoring those sequences that contain 
more instances of distinct patterns over others. The value of Thr prob is determined by 
assuming that the patterns ought to be able to discriminate among sequences in a database 
as large as GenPept; although for a database of this size an estimated log-probability of 
25 -25 or less ought to suffice. Thus, the more stringent threshold of Thr pr0 b = -30 was used 
with the understanding that this will result in a sacrifice in sensitivity. But as the results 
will demonstrate, even with this stringent threshold, the redundancy of each composite 
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descriptor leads to a sensitivity that is satisfactory. Also, in all three cases the following 
a.a. equivalences are assumed: {A, G}, C, {D, E}, {F, Y}, H, {I, L, M, V}, {K, R} ? {N, 
Q},P, {S,T}, W. 



5 The First Example: EF1G / PSS0040 

An application of the above described methodology is in the context of the 
PROSITE database. Although numerous entries in PROSITE contain succinct and 
specific patterns capturing most or all of the members of the corresponding collection, 
there exist entries for which only a profile/matrix is available: PS50040, the family of 

10 elongation factor 1 gamma chain proteins is one such example thus making it an ideal 
candidate for processing with the described method. 

PS50040 comprises 10 full sequences (EF1G_ARTSA, EF1GCAEEL, 
EF1G_HUMAN, EF1G_RABIT, EF1G_SCHP0, EF1GJTRYCR, EF1G_XENLA, 
EF1G_YEAST, EFIHJXENLA, EF1H_YEAST) and 1 fragment (EF1G_PIG). The 

15 reported profile matrix captures all 10 full sequences, misses the one fragment and 
generates no false positives when the target database is Swiss-Prot Rel. 38.0. 

It should be noted here that if one relaxes the constraints imposed by the 
chemical equivalence classes shown above, it is possible to discover a specific pattern 
that belongs to all 11 members of PS50040 and generates no false positives when used in 

20 conjunction with Swiss-Prot Rel. 38.0. In fact, this pattern is 



[ILMV]..[NW][ILMV]^^ [AG] 

and can be used to describe and capture elongation factor 1 gamma chain proteins; the 
deviations from the above chemical equivalence classes are shown in boldface. 
25 The composite descriptor was built for this collection by setting the 

Teiresias parameters to L=5 and W=10; since the dataset is small there was only a single 
iteration over the dataset with a threshold choice of K=6. In other words, the composite 
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descriptor was built by discovering patterns that involved a minimum of 5 non-wild cards 
in any rolling window that spans 10 positions and begins/ends with a literal, a relatively 
high-degree of local similarity (i.e. 50% or higher). Those patterns whose estimated 
log-probability was equal to -30,0 or less were selected and this generated a composite 
5 descriptor that comprised 2,260 patterns. 

First, a corresponding DFA (deterministic finite automaton, which will 
only recognize instances of the composite descriptor patterns in a query sequence and 
which performs method step 260) was used to search a randomized version 
RAND-Swiss-Prot of Swiss-Pro t (Release 38.0) that was obtained by applying a 

10 randomly chosen permutation to the amino acids of each of the valid sequences. Both the 
composition and lengths of individual sequences were maintained by this operation. The 
global counter for each randomized sequence was derived by summing up the local 
counters from each sequence region that received non-zero support. The sequences were 
then sorted in order of decreasing global-counter value. Twenty seven (27) randomized 

15 sequences received non-zero support with global counter values that ranged between 1 
and 2 inclusive. Thres ran d was thus set to 3, and the DFA was subsequently used to search 
the actual Swiss-Prot database. Of the 69 sequences that received non-zero support, only 
16 exceeded the predefined threshold. The support values for the 16 sequences were: 
EF1G_HUMAN 861, EF1G_RABBIT 846, EF1G_XENLA 791, EF1H_XENLA 765, 

20 EF1G_ARTSA 349, EF1GCAEEL 228, EF1G YEAST 110, EFlG_SCHPO 110, 
EF1HPIG 96, EF1H_YEAST 94, EF1GTRYCR 88, SYV_FUGRU 7, GTT1JRAT 5, 
GTTl_MOUSE 5, SYEP_HUMAN 3 and GTH4_MAIZE 3. 

Note that the 5 hits SYV_FUGRU, GTT1RAT, GTTlJVtOUSE, 
SYEP_HUMAN and GTH4JMALZE are clearly separated from the 11 top scoring 

25 sequences. They do however obtained scores which were above threshold and thus are 
studied in more detail. In all 5 cases, one or more sizeable regions that were shared with 
one or more members of the PS50040 collection were discovered. The Clustal-W 
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alignment of EF1G_XENLA and the N-terminus of SYV_FUGRU, a valyl-trna 
synthetase from Fugu rubripes, are shown in Table 1 below. Table 1 shows a Clustal-W 
alignment of EF1GXENLA and the N-terminus of SYVJFUGRU, and this shows a 
strong similarity. As can be seen, the similarity among these two sequences is pretty 
5 extended and the Clustal-W score for the shown alignment equaled 462. 

Similar shared regions are present in GTTIJRAT & GTTl_MOUSE (a 
glutathione s-transferase 5 from Rattus norvegicus and a glutathione s-transferase theta 1 
from Mus musculus respectively), SYEP HUMAN (a multi-functional aminoacyl 
trna-synthetase from Homo sapiens) and GTH4JMAIZE (a glutathione s-transferase IV 

10 from Zea mays). The Clustal-W alignments for these cases are shown in Tables 2 
through 4 below. Table 2 shows a Clustal-W alignment showing a substantial similarity 
between GTT1JRAT, GTTl_MOUSE and EF1GARTSA. The Clustal-W score is 1577. 
Table 3 shows a Clustal-W alignment between a fragment from EF1GCAEEL (a.a. 100 
through 243) and a fragment from SYEP_HUMAN (a.a. 1 through 180) showing a shared 

15 region. The Clustal-W score for this alignment is 74. Table 4 shows a Clustal-W 
alignment showing a strong similarity between EF1G_RABIT and GTH4JMAIZE. The 
Clustal-W score is 215. 

It should be noted that a search of MEDLINE has indicated that with the 
exception of the similarity between the EF1G family and the valyl-tRNA from Fugu 

20 rubripes, none of the other similarities shown here has been reported in the literature. 

In summary, the composite descriptor has correctly picked out the 
members of PS50040 from the contents of Swiss-Prot as well as has identified several 
substantial similarities with other sequences in the database. 



25 
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Table 1 

EF1G_XENLA MAGGTLYTYPDNWRAYKPLIAAQYSGFPIKVASSAPEFQFGVTNKTPEFLKKFPLGKVPA 

SYV_FUGRU_piece MA--TLYVSP HLDDFRSLLALVAAEY 

** *** _ * : . . * : * * . * : 

EF1G XENLA FEGKDGFCLFESSAIAHYVGNDELRGTTRLHQAQVIQWVSFSDSHIVPPASAWVFPTLGI 
SYV FUGRU piece C GNAKQ QSQVWQWLSFADNELTPVSCAWFPLMGM 

_ * ** ; *;** **.**.*_;.* ;.* *** 

EF1G XENLA MQYNKQATEQAKEGIKTVLGVLDSHLQTRTFLVGERITLADITVTCSLLWLYKQVLEPSF 
SYV FUGRU piece TGLDKKIQQNSRVELMRVLKVLDQALEPRTFLVGESITLADMAVAMAVLLPFKYVLEPSD 
~" — .*. : ** ***. * : . ******* ***** :: * : :: * : * ***** 

EF1G XENIiA RQPFGNVTRWFVTCVNQPEFRAVLGEVKLCDKMAQFDAKKFAEMQPKKETPKKEKPAKEP 
SYV FUGRU_piece ROTLMNVTRWFTTCINQPEFLKVLGKISLCEKMVPVTAKTSTEEAAAVH-PDAAALNGPP 

_ *, . ****** ^ *★ ; ***** ***;;.**;**. . **. I*..*. * 

EF1G XENLA KKEKEEKKKAAPTPAPAPEDDLDESEKALAAEPKSKDPYAHLP- KSSFIMDEFKRKYSNE 

SYV_FUGRU _piece KTEAQLKKEAKKREKLEKFQQKKEMEAKKKMQPVAEKKAKPEKRELGVITYDIPTPSGEK 
*_*.**.* .. ** :*::. : . . * :: • : : 

EF1G XENLA DTLTVALPYFW-EHFDKEGWSIWYAEY-KFPEELTQAFMSCNLITGMFQR-LDKLRKTGF 
SYV FUGRU piece KDWSPLPDSYSPQYVEAAWYPWWEKQGFFKPEFGRKSIGEQNPRGIFMMCIPPPNVTGS 
- ;> . .. . * * : : **:::.: *:* : - ** 

EF1G XENLA ASVILFGTNNNSSISGVWV- FRGQDLAFTLSED WQIDYESYNWRKLDSGSEEC- - 

SYV FUGRU piece LHLGHALTNAIQDTLTRWHRMRGETTLWNPGCDHAGIATQVWEKKLMREKGTSRHDLGR 
~ ~~ : **.. * : ** : :..* * : * . *:.:..: 

EF1G XENLA KTLVKEYFAWEGE FKNVGKPFNQG- KIFK 

SYV_FUGRU_piece EKFIEEWKWKNEKGDRIYHQLKKLGSSLDWDRACFTMDPKLSYAVQEAFIRMHDEGVIY 
~~ ....*■*.* :*::*..::. * . 
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Table 2 

- VLELYLDLLS Q P CRAI Y I FAKKNNI P FQMHTVELRKGEHLSDAFARVNPMKKVP AMM - D 

- VLE L Y LD L L S Q P CRAI Y I F AKKNN I P F QMHT VE LRKGE HL S DA F AQ VN PMKKV P AM K - D 
VAGKLYTYPENFRAFKALIAAQYSGAKLEIAKSFVFGETMKSDAFLKSFPLGKVPAFESA 

.** * *- ::: . : : **** : *: ****: 

GGFTLCESVAI LLYLAHK YKVPDHWYPQDLQARARV 

GGFTLCE S VAI LLYLAHK YKVPDHWYPQDLQARARV 

DGHCIAESNAIAYYVANETLRGSSDLEKAQIIQWMTFADTEILPASCTWVFPVLGIMQFN 

.*.:.*****:*:: . . * * 

DEYLAWQHTGLRRSCLRALWHKVMFPVFLGEQIPPETLAATLAELDVNLQVLEDKFLQDK 

DE YLAWQHTTLRRS CLRTLWHKVMFP VFLGEQ I RP EMLAATLADLD WVQVLEDQ FLQDK 
KQATARAKEDIDKALQALDDHLLTRTYLVGERITLADIWTCTLLHLYQHVLDEAFRKSY 
* . ... * : .::**:* :..*:*.: :**::*:. 

DFLVGPHISLADLVAITELMHPVGGGCPVFEGHPRLAAWYQRVEAAVGKDLFREAHEVIL 
DFLVGPHISLADWAITELMHPVGGGCPVFEGRPRLAAWYRRVEAAVGKDLFLEAHEVIL 
VNTNRWFITLINQKQVKAVIGDFKLCEKAGEFDP- - - KKYAEFQAAIGSGEKKKTEKAPK 
* . * . . . . * * *..:**:*.. 

KVKDCPPADLI I KQKLMPRVLTMI Q 

KVRDCPPADPVIKQKLMPRVLTMIQ 

AVKAKPEKKEVPKKEQEEPADAAEEALAAEPKSKDPFDEMPKGTFNMDDFKRFYSNNEET 

* : * : * : : . - : 

KS I P Y FW E KFD KEN Y S I W Y S E YKYQD E L AKV YMS CNL I T GM FQR I E KMRKQ AF AS V C V FG 

EDNDSSISGIWVWRGQDLAFKLSPDWQIDYESYDWKKLDPDAQETKDLVTQYFTWTGTDK 



GTT1_M0USE 
GTT1JRAT 
EF1G ARTSA 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



GTTlJYtOUSE 
GTT1_RAT 
EF1G ARTSA 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



GTT1_M0USE 

GTT1_RAT 

EF1G ARTSA QGRKFNQGKIFK 



Table 3 



EF1G CAEEL (100-243) NFD KKTVEQYK- -NELNGQLQVLDRVLVKKTYLVGERLSLADVSVALDLLPAF 

SYEP~~ HUMAN (1-180) MEHTEIDHWLEFSATKLSSCDSFTSTINELNHCLSLRTYLVGNSLSLADLCWATLKGNA 



.***★*. *****- * 



EF1G CAEEL (100-243) QYVLDANARKSIVNVTRWFRTWNQPAVKEV- -LGEVSLASS-VA-QFNQ- -AKFTELS- 
SYE P _ HUMAN (1-180) AWQEQLKQKKAPVHVKRWFGFLEAQQAFQSVGTKWDVSTTKARVAPEKKQDVGKFVELPG 
~~ . ...*.*.**** . **.:.* : * * 



EF1G CAEEL (100-243) AKVAKSAPKAEKPKKEAKPAAAA- - AQP E DD-EPKEEKS-KDP- - 

SYEP~HUMAN ( 1- 180 ) AEMGKVTVRFP PE ASGYLH I GHAKAALLNQHYQVNFKGKL IMRFDDTNPEKEKEDFE KVI 
- ** . * * ** * : **:*::**.: 
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Table 4 



EFIG 
GTH4] 


_RABIT 
_MAIZE 


MAAGTLYTYPEITWRAFKALIAAQYSGAQVRVLSAPPHFHFGQTNRTPEFLRKFPAGKVPA 
-ATPAVKVYGWAISPFVSRALLALEEAGVDYELVPMSRQDGD-HRRPEHLARNPFGKVPV 

. * : * * . * 


EFIG 
GTH4] 


RABIT 

"maize 


FEGDDGFCVFESNAIAYYVS NEELRGSTPEAAAQWQWVSFADSDIVPPAST 

LE-DGDLTLFESRAIARHVLRKHKPELLGGGRLEQTAMVDVWLEVEAHQLSPPAIAIWE 
:**..: ;***.*** : * * * * _ * :: *** : 


EFIG 
GTH4~ 


_RABIT 
~MAIZE 


WVFPTLGIMHHNKQATENAKEEVKRI LGLLDAHLKTRTFLVGERVTLAD I TWCTLLWLY 
C V F AP F L GRE RNQ AWD ENVE KLKKVLE VY E ARL AT CT YL AGD F L S LADL S P F - T I MH CL 


EFIG 

GTH4_ 


RABIT 
~MAIZE 


KQVLEPSFRQAFPMTNRWFLTCINQPQFRAVLGEVKLCEKMAQFDAKKFAESQPKKDTPR 
MATEYAALVHALPHVSAWWQGLAARP AAN KVAQF- -MPVGAGAPKEQE- - 



The Second Example: G protein-coupled receptors 

5 The family of G protein-coupled receptors has a long evolutionary history 

and is of particular importance for signal transduction in all eukaryotes. Spanning the 
lipid bilayer of the plasma membrane with seven helices, they bind and form signal 
transducing couples that are at the center of many key processes such as visual excitation, 
olfaction, histamine secretion in allergic reactions, and chemotaxis. G protein-coupled 

10 receptors form a very diverse family and extensive studies have shown that single 
descriptor approaches do not suffice to characterize the family's members. 

Despite considerable efforts, very few membrane proteins have yielded 
high-resolution X-ray crystallographic data; this led to increased use of electron 
microscope approaches. The first such data were in fact obtained for bacteriorhodopsin, 

15 the bacterial analogue of rhodopsin, where a 3 J electron-microscopy reconstruction of it 
has established directly the presence of the seven transmembrane helices. The significant 
sequence similarity that the members of this family exhibit indicates that they ought to 
have the same topology. 

In order to demonstrate the power of the present invention and its ability to 

20 generalize, the experiment began with the contents of the GPCRDB as they existed in 
May 1998. Note that from this collection the hypothetical proteins from Caenorhabditis 
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elegans are excluded since it was intended to carry out GPCR-discovery in this genome. 
The bacterial analogues of rhodopsin as well as all listed G-proteins were also excluded. 
What was left was a total of 1,019 GPCR entries, of which 862 were complete sequences 
and 157 were fragments. This set was intersected with an older release of Swiss-Prot 

5 (Release 35.0 from November 1997) and determined that the intersection of the two 
databases comprised a total of 804 sequences and fragments. Starting with data that were 
almost two years old was intentional since it was important that the ability of the 
composite descriptors to generalize and identify additional candidate sequences in the 
much larger databases of today would be shown. 

10 The collection of 804 GPCR sequences and fragments contained several 

classes (e.g. rhodopsin-like, secretin-like, pheromone, etc.) of proteins. In turn, each of 
these classes comprised several representatives. Instead of selecting representatives from 
each of the identified classes, the order of the sequences in this set of 804 members were 
randomized. Note that the contents of the sequence themselves remained unchanged, 

1 5 only their order of appearance was modified. For example, the 61 3-th sequence was now 
listed 4-th, the 11-th sequence now appeared in the 45-th position, and so on. 
Subsequently, a training set T was formed by collecting the sequences and fragments 
listed in the first 80 positions, arguably a very small set if one considers the diversity of 
the GPCR family. Essentially, slightly less than 1/10-th of the available dataset were 

20 randomly sud-selected for the purposes of building the composite descriptor. Table 5 
below contains a listing of the labels of the 80 sequences in this training set. Table 5 
shows the Swiss-Prot labels of the 80 sequences in the training set for the G 
protein-coupled receptor experiment. The labels are listed in the order they were selected 
and they correspond to both sequences and sequence fragments. 

25 
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Table 5 

1 through 20 | 21 through 40 | 41 through 6 0 \ 61 through 80 | 



EBI2 HUMAN 


OPSD CORAU 


ACM2 HUMAN 


PACR RAT 


ML IX HUMAN 


ACM4 XENLA 


VIPR_MELGA 


CRFR CHICK 


ACM3 CHICK 


GIOD MOUSE 


VIBR HUMAN 


OLF9 RAT 


P2YR MOUSE 


0LF4 CHICK 


MGR8 HUMAN 


ACM4 MOUSE 


OAR DROME 


ACM3 PIG 


SSR4 RAT 


NY4R MOUSE 


AAIR CHICK 


5H1A MOUSE 


HHIR MOUSE 


5HTB DROME 


MAM2 SCHPO 


MSHR BOVIN 


NK2R RAT 


GU3 8 RAT 


SCRC RAT 


OLF5 RAT 


MSHR HUMAN 


PF2R BOVIN 


PAFR CAVPO 


GU03 RAT 


A2AA PIG 


OPSB GECGE 


ACM3 RAT 


P2YR BOVIN 


B3AR BOVIN 


AA3R HUMAN 


m 1 ,T PTTMATJ 


GPCR LYMST 


OPSB HUMAN 


MC3R MOUSE 


GPRJ MOUSE 


FMLR RABIT 


GPRO HUMAN 


BAR2 SCHCO 


D4DR MOUSE 


BIAR HUMAN 


5H2A CRIGR 


CRFR HUMAN 


MLIC CHICK 


D3DR RAT 


PER4 MOUSE 


MC4R RAT 


PER2 RAT 


PF2R MOUSE 


OPSD CATBO 


OPSB ANOCA 


OPRX PIG 


PERI RAT 


ACMl RAT 


IL8A RAT 


AAIR HUMAN 


GRPR MOUSE 


OPS 2 SCHGR 


AAIR CAVPO 


DOPl DROME 


GRFR PIG 


GRHR HUMAN 


AG2R HUMAN 


OXYR RAT 


NKIR RANCA 


NKIR RAT 


GPRM HUMAN 


B3AR MOUSE 


OLFl HUMAN 


EDG2 SHEEP 


CASR HUMAN 


EBI2 HUMAN 


OPSD CORAU 


ACM2 HUMAN 


PACR RAT 


ML IX HUMAN 


ACM4 XENLA 


VI PR MELGA 


CRFR CHICK 


ACM 3 CHICK 


GIOD MOUSE 


VIBR HUMAN 


OLF9 RAT 


P2YR MOUSE 


OLF4 CHICK 


MGR8 HUMAN 


ACM4 MOUSE ' 


OAR DROME 


ACM3 PIG 


SSR4 RAT 


NY4R MOUSE 



5 As in the previous example, the patterns were discovered assuming the 

equivalence classes {A, G}, C, {D, E}, {F, Y}, H, {I, L, M, V}, {K, R}, {N, Q}, P, {S, 
T}, W. The Teiresias parameters were set to L=5, W=10, whereas the successive 
threshold choices were K=80, K=16 and K=3. It was set out to discover patterns that 
involved at least 5 non-wild cards in any rolling window that spans 10 positions and 

10 begins/ends with a literal, which is a relatively high-degree of local similarity (i.e., 50% 
or higher). Those patterns whose estimated log-probability was equal to -30.0 or less 
were selected and this generated a composite descriptor that comprised 1,703 patterns. 

First, the corresponding DFA (deterministic finite automaton, which will 
only recognize instances of the composite descriptor patterns in a query sequence and 

15 which performs method step 260) was used to search a randomized version 
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RAND-Swiss-Prot of Swiss-Prot (Release 38.0) (see also relevant discussion in the 
PS50040 example). The sequence regions with non-zero local counters were identified 
and the maximum counter values from each such region were summed up; the sum-total 
was attached to the sequence label and the sequences were sorted in order of decreasing 

5 sum value. A total of 1,564 sequence fragments from RAND-Swiss-Prot received 
non-zero support and the actual histogram of these values is shown in Fig. 3. Of those 
1,564 fragments, 1,548 received a support value that was less than 9. Thus Thres ra „d=10 
was selected; this threshold choice corresponded to the 99-th percentile. 

Subsequently, the same DFA was used to search the actual Swiss-Prot 

10 database testing each of its 80,236 sequences for membership in the G protein-coupled 
receptor family. Sum values were attached to each sequence as above and only 947 
sequences from Swiss-Prot that received support greater than or equal to Thres ra „d=10 
were kept. 

hi order to determine the quality of the composite descriptor and determine 
15 the number of true and false positives that the descriptor gives rise to, the Swiss-Prot 
annotation (keyword "KW" lines) was used for each of these 947 sequences. Of these 
retrieved sequences, 928 are actually listed as 'G protein-coupled receptor's, 10 are 
eukaryotic transmembrane proteins (SUR7_YEAST, C561_HUMAN, YIPC_YEAST, 
NU4M_APIME, SCG2_XENLA, GTR2JLEIDO, GARP_HUMAN, CIN6_HUMAN, 
20 CIN3_RAT, PLSC_COCNU), 2 are hypothetical eukaryotic transmembrane proteins 
(YJZ3_YEAST,YMJC_CAEEL), 2 are hypothetical proteins (YKY4YEAST, 
YCX7_YEAST), and finally 5 are bacterial false positives (PIP_BACCO, 
VIRRAGRT6, YQGP_BACSU, HBD CLOTS, PROAHAEIN). 

This is a very notable result, given the comparatively small amount of 
25 information that is captured by the 80-sequence input set and the diversity of the G 
protein-coupled receptor family. Table 6 below contains a listing of the labels of the 947 
Swiss-Prot sequences whose support exceeded threshold; the labels are listed in order of 
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decreasing value of the global counter that was associated with the corresponding 
sequence, and the 5 false positives are shown in boldface. Table 6 shows the labels of the 
947 sequences from Swiss-Prot Release 38.0 that received support above threshold in the 
G protein-couple receptor example. The 5 false positives are shown with an "(FP)." 



Table 6. 



1 tVimncrh 50 I 


51 through 100 1 


101 through 150 1 


151 through 200 1 


701 though 250 






OAR DROME 


B2AR CANFA 


A2AC DIDMA 


5H1A RAT 


MSHR CEREL 


B3AR MOUSE 


AA1R RAT 


ACM4 XENLA 


5H1A MOUSE 


MSHR CAPHI 


B3AR CAVPO 


AA1R RABIT 


ACM1 MOUSE 


5H1A HUMAN 


MSHR CAPCA 


B3AR RAT 


AA1R HUMAN 


AA3R SHEEP 


SCRC RABIT 


MSHR ALCAA 


OAR HELVI 


AA1R CANFA 


AA2B CHICK 


A1AA CANFA 


VI PR CARAU 


OAR BOMMO 


AA1R CHICK 


ACM4 MOUSE 


MC5R HUMAN 


VI PR HUMAN 


OAR2 LOCMI 


B1AR XENLA 


ACM4 HUMAN 


AA3R CANFA 


HH1R BOVIN 


OAR1 LOCMI 


NK1R RANCA 


5H4 CAVPO 


5H4 RAT 


5HT LYMST 


GREC BALAM 


B1AR SHEEP 


AA2B HUMAN 


NK2R CAVPO 


HH1R RAT 


B1AR RAT 


AA1R BOVIN 


NK1R RAT 


SSR1 RAT 


HH1R HUMAN 


B1AR MOUSE 


A2AA CAVPO 


NK1R MOUSE 


SSR1 MOUSE 


HH1R CAVPO 


B1AR PIG 


A2AA HUMAN 


MC5R RAT 


SSR1 HUMAN 


SSR3 HUMAN 


B1AR MACMU 


5H2B HUMAN 


MC5R MOUSE 


NK4R HUMAN 


5HT BOMMO 


B1AR HUMAN 


AA1R CAVPO 


MC5R BOVIN 


OPRK RAT 


NK2R HUMAN 


B1AR CANFA 


ACM3 PIG 


MC5R SHEEP 


OPRK MOUSE 


NK3R RAT 


B1AR MELGA 


ACM3 HUMAN 


AA3R RAT 


OPRK HUMAN 


NK3R MOUSE 


B4AR MELGA 


ACM3 CHICK 


DOP1 DROME 


OPRK CAVPO 


NK3R HUMAN 


B3AR BOVIN 


ACM 3 BOVIN 


AA2B RAT 


5HTA DROME 


SSR3 RAT 


B3AR CANFA 


A2AC HUMAN 


AA2BJYIOUSE 


AA3R RABIT 


SSR3 MOUSE 


PACR RAT 


5H7 RAT 


A1AB RAT 


A2AB RABIT 


MSHR MOUSE 


B3AR MACMU 


5H7 MOUSE 


A1AB MOUSE 


A2AB MAC PR 


GRFR PIG 


B3AR HUMAN 


5H7 HUMAN 


A1AB MESAU 


MC3R MOUSE 


NK2R RABIT 


B2AR MOUSE 


5H7 CAVPO 


A1AB HUMAN 


MC3R HUMAN 


NK2R BOVIN 


SCRC RAT 


A2AD HUMAN 


D3DR HUMAN 


5H1B FUGRU 


VIPR PIG 


B2AR RAT 


A2AC RAT 


D3DR CERAE 


ACM4 RAT 


5H1A FUGRU 


B2AR MESAU 


A2AC MOUSE 


NK1R HUMAN 


A2AB TALEU 


5HTB DROME 


B2AR BOVIN 


A2AC CAVPO 


NK1R CAVPO 


A2AB PROHA 


VI PS HUMAN 


B2AR PIG 


VI PR RAT 


SSR4 RAT 


A2AB ORYAF 


MC4R RAT 


5H2A PIG 


A2AA RAT 


SSR4 MOUSE 


A2AB HORSE 


MC4R HUMAN 


5H2A MACMU 


A2AA PIG 


SSR4 HUMAN 


A2AB ERIEU 


A1AB CANFA 


5H2A HUMAN 


A2AA MOUSE 


D2D1 XENLA 


A2AB ELEMA 


NK2R RAT 


5H2A CRIGR 


ACM3 RAT 


AA3R HUMAN 


A2AB DUGDU 


NK2R MOUSE 


5H2A RAT 


A2AR LABOS 


5H7 XENLA 


A1AD HUMAN 


NK2R MESAU 


5H2A MOUSE 


ACM4 CHICK 


D3DR RAT 


A2AB DIDMA 


D2D2 XENLA 


PACR MOUSE 


AA2A HUMAN 


D3DR MOUSE 


A1AD RAT 


MSHR HUMAN 


5H2C RAT 


AA2A CANFA 


A1AA ORYLA 


A1AD RABIT 


5H1F RAT 


5H2C HUMAN 


AA2A RAT 


A2AB CAVPO 


A1AD MOUSE 


5H1F MOUSE 


5H2C MOUSE 


AA2A MOUSE 


D2DR MOUSE 


OPRM RAT 


5H1F CAVPO 


PACR BOVIN 


A1AA RAT 


D2DR HUMAN 


OPRM PIG 


5H1F HUMAN 


5H2B MOUSE 


A1AA RABIT 


D2DR FUGRU 


OPRM MOUSE 


DOP2 DROME 


PACR HUMAN 


A1AA HUMAN 


D2DR CERAE 


OPRM HUMAN 


HH2R CAVPO 
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D4DR RAT 


A1AA BOVIN 


D2DR BOVIN 


OPRM BOVIN 


HH2R CANFA 


D4DR MOUSE 


ACM 2 RAT 


A2AB RAT 


MC3R RAT 


HH1R MOUSE 


SCRC HUMAN 


ACM2 PIG 


A2AB MOUSE 


AA2A CAVPO 


GASR PRANA 


D4DR HUMAN 


ACM2 HUMAN 


A2AB HUMAN 


MSHR BOVIN 


GASR HUMAN 


VI PR MELGA 


ACM2 CHICK 


5H4 MOUSE 


MSHR VULVU 


A2AB BOVIN 


5H2B RAT 


ACM5 RAT 


ACM1 RAT 


MSHR SHEEP 


5H1B RABIT 


A2AR CARAU 


ACM 5 MACMU 


ACM1 PIG 


MSHR RANTA 


MSHR HORSE 


B2AR MACMU 


ACM5 HUMAN 


ACM1 MACMU 


MSHR OVIMO 


GASR RABIT 


B2AR HUMAN 


5HT1 DROME 


ACM1 HUMAN 


MSHR DAMDA 


GASR MOUSE 






1 ^1 tHmn-liinn I ^nithr^i^^n 1 ^1 through 400 1 401 throueh 450 1 451 though 500 | 






GASR CANFA 


IL8B GORGO 


GPRL HUMAN 


HSjtZ 1 Kill 


BRS4 BOMOR 


GASR BOVIN 


IL8B BOVIN 


MGR8 HUMAN 


twit i 7\ cij'nr'C'D 


V1BR RAT 


5H1D RAT 


5H5A RAT 


GPCR LYMS 1 


MT 1 A PT-TOQT7 


rdo-j SHEEP 


5H1D MOUSE 


5H5A MOUSE 


A2AB AMBHO 


TVTT 1 A UT TM S. "NT 


RRS3 MOUSE 


5H1D FUGRU 


5H6 HUMAN 


BRB2 HUMAN 


h/t/^ti o o TV T 1 


Rpq-i HUMAN 


5H1D CAVPO 


VI PS RAT 


OLF5 RAT 


XT OA /^i^iTl c\ 

ILioA LjUKLjU 


\JlTiD yJ UIl J- v^Xn. 


5H1B SPAEH 


VI PS MOUSE 


r~TT1 "n T TT TftVT 7\ "KT 

5H1E HUMAN 




Mf4R R MOUSE 


5H1B RAT 


VI PR MOUSE 


5H1D RABIT 


iTl"D DAT 


OPSB ANOCA 


5H1B MOUSE 


CCKR RAT 


OP SI PATYE 


nnDP ivtihttctt 
CjjFKU iYlUUoii 


ni-TClP "RAT 


GASR RAT 


CCKR HUMAN 


CKR1 MACMU 


LiirKC HUMAN 


P.UCD PIG 


YYI3 CAE EL 


CCKR CAVPO 


CKR1 HUMAN 


*-i -n"D "3 MOTTO T? 


rirroTj WTTMAN 

vjllijA. nuj imn 


MSHR CHICK 


IL8B PANTR 


BRB2 MOUSE 


i—i —\ ttt TtJt A "NT 

CjPKJ human 


TTMT.1 PANTR 


5H1B CRIGR 


IL8B MACMU 


DADR XENLA 






5H1B CAVPO 


IL8B HUMAN 


DADR PIG 


EDG2 MUUbb 


paQd RAT 


SSR2 RAT 


IL8A PANTR 


DADR HUMAN 


nr\j^lO T TT 1T\ iT 7V "KT 

EDG2 HUMAN 


C.WO APT, PA 


SSR2 MOUSE 


IL8A HUMAN 


DADR DIDMA 


riT 1 T^JT/^T TO TP 

BLR1 MOUSE 


nDCTl &T.T.MT 
U ±r O JJ LiVi ± 


B3AR PIG 


AG2R MELGA 


D1DR CARAU 


OLF4 RAT 


HH"D A PAT 1 


SSRz HUMAN 


7\POP PUTPW 
K. \_xlJ. LIv 


RHTl APLCA 

Jill -L xj*4- -i— 1 a. 


GRPR RAT 


fifp/i PAPATCT 


5H1B HUMAN 


OLF0 RAT 


5H2A CANFA 


GRPR MOUSE 


CCR4 MOUSE 


OPRX RAT 


GPRF HUMAN 


OLF9 RAT 


GALS HUMAN 


CCR4 MACMU 


OPRX MOUSE 


CCKR XENLA 


OLF1 RAT 


FMLR MACMU 


CCR4 MAC FA 


OPRX CAVPO 


AG2S RAT 


FML1 PONPY 


CKR3 MACMU 


CCR4 HUMAN 


HH2R HUMAN 


AG2S MOUSE 


DCDR XENLA 


AG2R RABIT 


CCR4 FELCA 


5H1D HUMAN 


AG2S HUMAN 


DADR RAT 


EDG2 BOVIN 


CCR4 CERTO 


5H1D CANFA 


AG2R RAT 


FML1 MACMU 


GALS RAT 


CCR4 BOVIN 


HH2R RAT 


AG2R PIG 


FML1 HUMAN 


GALS MOUSE 


APJ HUMAN 


OPRX PIG 


AG2R MOUSE 


FML1 GORGO 


G10D RAT 


OPSD RANTE 


OPRX HUMAN 


AG2R MERUN 


GPR1 RAT 


G10D MOUSE 


5H1B PIG 


SSR2 PIG 


AG2R HUMAN 


BRB2 RAT 


OPSD RAT 


CKR8 MOUSE 


SSR2 BOVIN 


AG2R CANFA 


GALR RAT 


CKR3 MOUSE 


CKR1 MOUSE 


TLR2 DROME 


AG2R BOVIN 


GALR MOUSE 


VI BR HUMAN 


OPSP CHICK 


HH2R MOUSE 


5H6 RAT 


GALR HUMAN 


FML1 MOUSE 


OPSD OCTDO 


5H1B DIDMA 


GPRF MACNE 


D5DR FUGRU 


OPSD CRIGR 


OPRM CAVPO 


A2AB ECHTE 


GPRF CERAE 


D1DR OREMO 


BRS3 CAVPO 


OPSX MOUSE 


5HT HELVI 


GP3 8 HUMAN 


FMLR MOUSE 


ML1A CHICK 


OPSX HUMAN 


5H1D PIG 


5H2A CAVPO 


BRB2 RABIT 


TLR1 DROME 


C3AR HUMAN 


OPRD RAT 


CCKR MOUSE 


SSR5 HUMAN 


OPSD TRIMA 


OPSD RAJER 


OPRD MOUSE 


YDBM CAEEL 


DBDR RAT 


OPSD SHEEP 


ML IX HUMAN 


OPRD HUMAN 


NYR DROME 


DBDR HUMAN 


OPSD RABIT 


AG2S XENLA 


GALT RAT 


OLFD CANFA 


5H2B PIG 


OPSD PIG OLF6 RAT 
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GALT MOUSE 


GRFR MOUSE 


ML1A MOUSE 


OPSD PHOVI 


ML IB HUMAN 


5H5A HUMAN 


GRFR HUMAN 


MC4R MOUSE 


OPSD PHOGR 


OLFJ HUMAN 


GPRA HUMAN 


SSR5 RAT 


MAM2 SCHPO 


OPSD PETMA 


ML1B CHICK 


IL8B RAT 


SSR5 MOUSE 


FMLR RABIT 


OPSD MOUSE 


GPRJ HUMAN 


IL8B MOUSE 


OLFE HUMAN 


D1DR FUGRU 


OPSD MAC FA 


GPR4 PIG 


IL8A RABIT 


OPRD PIG 


IL8A RAT 


OPSD HUMAN 


GPR4 HUMAN 


GRFR RAT 


IL8B RABIT 


BLR1 RAT 


OPSD CANFA 


CKR8 HUMAN 


5H5B RAT 


OLF4 CANFA 


DBDR XENLA 


OPSD BOVIN 


GPRJ MOUSE 


5H5B MOUSE 


GU2 7 RAT 


CASR HUMAN 


GALT HUMAN 


GPR8 HUMAN 


GPRA RAT 


OLFI HUMAN 


BLR1 HUMAN 


OAR2 LYMST 


OPSD XENLA 



501 through 550 1 551 through 600 1 601 through 650 I 651 through 700 1 70 1 though 750 I 



OPSD TURTR 


NY2R PIG 


OPS I ASTFA 


GC96 HUMAN 


OPSP P.hIMA 


OPSD RANPI 


NY2R HUMAN 


OPSG ORYLA 


YTJ5 CAE EL 


OPSD ZiiUr'A 


OPSD RANCA 


NY2R BOVIN 


OPSG CARAU 


OLF8 RAT 


OPSD COTBO 


OPSD MESBI 


NY2R MOUSE 


OPSD GAMAF 


NY1R XENLA 


OPSD ABYKU 


OPSD GLOME 


CKR6 HUMAN 


FML2 MACMU 


GPR5 HUMAN 


OLF2 CHICK 


OPSD DELDE 


C3AR MOUSE 


C5AR CANFA 


GIPR RAT 




OPSD BUFMA 


VQ3L CAPVK 


TRFR CHICK 


C5AR RAT 


OLF5 CHICK 


OPSD BUFBU 


OXYR PIG 


THRR RAT 


BONZ HUMAN 


AT T—l -1 /"IT TTHT/ 

OLF3 CHICK 


OPSD AMBTI 


OXYR MOUSE 


THRR PAP HA 


YR42 CAEEL 


OLFI CHICK 


ML1C CHICK 


OXYR MACMU 


THRR MOUSE 


OPSD NEOAU 


FSHR PIG 


CASR BOVIN 


OXYR BOVIN 


THRR HUMAN 


CKR4 HUMAN 


FSHR MACFA 


TRFR SHEEP 


EDG1 RAT 


THRR CRILO 


V1AR HUMAN 


FSHR HUMAN 


TRFR RAT 


EDG1 MOUSE 


CCR3 HUMAN 


OPSD SARTI 


FSHR HORSE 


TRFR MOUSE 


EDG1 HUMAN 


GPRO RAT 


OPSD SARSP 


FSHR EQUAS 


TRFR HUMAN 


ACTR HUMAN 


THRR XENLA 


BONZ MACNE 


DBDR BOVIN 


OXYR HUMAN 


OPSD SARPU 


PTRR DIDMA 


BONZ CERAE 


AG22 SHEEP 


OPSD LAMJA 


DADR RABIT 


NTR1 HUMAN 


RDC1 HUMAN 


US 2 8 HCMVA 


OPSD CHICK 


C5AR GORGO 


rTTi T"» nTH 

5H1E PICj 


nrnnTj DTP 


PTRT? RAT 


ML1C XENLA 


YLD1 CAE EL 


OLF3 CANFA 


PTRR HUMAN 


PTRR MOUSE 


GPRX ORYLA 


FMLR PONPY 


GPRJ RAT 


YR13 CAEEL 


OPSB API ME 


OPS1 SCHGR 


OPSB CONCO 


GPRD HUMAN 


OPSD NEOSA 


OLF7 RAT 


OLF2 RAT 


ML IX MOUSE 


GIPR HUMAN 


OPSD COMDY 


OL1C HUMAN 


ML1X SHEEP 


FSHR SHEEP 


OPSF ANGAN 


OPSD CATBO 


NY6R MOUSE 


CCR4 SHEEP 


FSHR BOVIN 


OPSD TAUBU 


OLF6 MOUSE 


ET1R RAT 


PE22 RAT 


ACTR BOVIN 


OPSD BATNI 


OPSD CAMAB 


ET1R PIG 


OPSP COLLI 


PE22 MOUSE 


OPSD BATMU 


OLFI HUMAN 


ET1R HUMAN 


GPR6 RAT 


PE22 HUMAN 


P2Y9 HUMAN 


OLFI CANFA 


ET1R BOVIN 


GPR6 HUMAN 


OX2 R RAT 


P2Y5 HUMAN 


ETBR RAT 


OPSB CHICK 


ACM1 DROME 


OX2R HUMAN 


P2Y5 CHICK 


ETBR PIG 


OLF2 HUMAN 


V1AR MOUSE 


5H1B CANFA 


OXYR SHEEP 


ETBR MOUSE 


OPSD LIMPA 


FMLR PANTR 


OGR1 HUMAN 


OPSD NEOAR 


ETBR HUMAN 


OPSD CYPCA 


FMLR HUMAN 


GPRV HUMAN 


NMBR RAT 


ETBR HORSE 


OPSD CARAU 


RDC1 CANFA 


ACTR MOUSE 


NMBR MOUSE 


ETBR COTJA 


OL15 MOUSE 


OPSD ANOCA 


ACTR MESAU 


TDA8 MOUSE 


ETBR CANFA 


GU5 8 RAT 


GPRK HUMAN 


FMLR GORGO 


H218 RAT 


ETBR BOVIN 


GU3 8 RAT 


FML2 PONPY 


OPSB GECGE 


GPRH HUMAN 


OPS2 LIMPO 


GU01 RAT 


FML2 PANTR 


RDC1 MOUSE 


CKR7 MOUSE 


OPS1 LIMPO 


OPSD PARKN 


FML2 HUMAN 


OXYR RAT 


CKR7 HUMAN 


OL7B MOUSE 


OPSD COTIN 


FML2 GORGO 


OPSU BRARE 


AG22 RAT 


OLID HUMAN 


NY6R RABIT 


EBI2 HUMAN 


OPSD SARXA 


AG22 MOUSE 


OL1A HUMAN 


OPSD ICTPU 
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C5AR PONPY 


OPSD SARMI 


AG22 HUMAN 


GU03 RAT 


GPRD RAT 


C5AR PANTR 


OPSD SARDI 


0LF3 HUMAN 


GRHR CLAGA 


GI PRIMES AU 


C5AR HUMAN 


OPSD MYRBE 


CKR4 MOUSE 


CKR3_HUMAN 


OPSB AST FA 


VI AR SHEEP 


NTR1 RAT 


OPSD ANGAN 


CKR3 CERAE 


GU45 RAT 


V1AR RAT 


GPRO HUMAN 


NMBR HUMAN 


C5AR MACMU 


DEZ HUMAN 


OPSP ICTPU 


OPSH CARAU 


OPSD TOD PA 


YWOl CAEEL 


OL13 MOUSE 


NY2R CAVPO 


OPSD POERE 


OPSD SEPOF 


OL1G HUMAN 


GPR7 HUMAN 


GPR1 HUMAN 


OPSD ORYLA 


OPSD PROJE 


YT6 6 CAEEL 


OX1R RAT 


AG2R XENLA 


OPSD MYRVI 


OPSD LOLFO 


OLF4 CHICK 


OX1R HUMAN 


OLF3 RAT 


C5AR MOUSE 


OPSD COTGR 


HM74 HUMAN 


BRB1 RABIT 



751 through 800 


O A 1 u_1 . , OCA 

801 throusn 850 


fUiwnrrli Gflfl 

oj i tnrougn yuu 


001 tVirnnffTi Q47 














YPHD ECOLI 


Ub<s / riL-lYlVA 


rip tip MOTTLE 


P2YR MOUSE 




OPSD SPHSP 


~K1/~\T\/~ I DTJT CM 

JSJ UJJ U KH J. b l v l 


rtpnP WTTMAN 


P2YR HUMAN 




OPSD LOLSU 


PTjyiO tJT TM S \T 
\jt"± <L rlUl v Lf-iiM 


(TRHP HOP°>E 


P2YR BOVIN 




GPRP HUMAN 


rin/ -i UTTMSKT 

b-F41 HUMAJN 


apuD POVTN 

VJXtriJA. DU V J. IN 


OPSB ORYLA 




CKRV MOUSE 


MGR4 KA1 


nDDO HiTMAN 
KjcrL^ nUrini'l 


NU4M API ME 




CKR5 RAT 


T OTTO CUPUD 

LbHK bHiibP 


r*T.PP MOTTLE 


ML1A BOVIN 




CKR5 MOUSE 


LSHR FILr 




YCX7 YEAST 




BAR2 SCHCO 


T CUD TJT ~XIJ[ 1\ \T 

LSHR HUMAN 


T7QTTP PAT 


SCG2 XENLA 




PI2R HUMAN 


T C T IT) PAT TA 

LSHR LALJA 


VTfVA VEAQT 


PIP BACCO (FP) 




OPSD COTKE 


riT -ri T\Ai~\T TOT? 

GLR MUUbb 


UA/ 1\. it ±\J 


PI2R RAT 




VK02 SPVKA 


EDCjL lYlUUbli 


nnqy puTPtf 
Ur d V Lnl ^_.iy 


PI2R BOVIN 




DEZ RAT 


UPbU PUIYIMJ. 




PAFR RAT 




DEZ MOUSE 


Tl/T/"^ T~> *~7 "□AT 1 

MGR / KA 1 


npcn pat 


P2UR RAT 




CKR5 PAPHA 


Ti/T/^TD 1 TJT TM A TvT 


nPPP, MOTTLE 


P2UR MOUSE 


— 


CKR5 PANTR 


CML2 RA1 




P2UR HUMAN 




CKR5 MACMU 


JS1Y1R KA.1 


ApCA TlPOPQ 


OPSR ORYLA 




CKR5 GORGO 


NY1R PIG 


1 T TJTTMAAT 


OP°>0 ^ALSA 




CKR5 CERTO 


>TTT -1 T*l TV JT/^T TC^ T7 1 

NY1R MOUbh 


Ui-i-L-B llUl v Lrl±\l 


(TTP? LET DO 




CKR5 CERAE 


NY1R HUMAN 




m .HP ANT EL 




CECP2 MOUSE 


NY1R CANFA 


GRHR RAT 


GARP HUMAN 




CKR2 HUMAN 


RTA RAT 


EDG3 HUMAN 


PAFR MOUSE 




OPSB CARAU 


OPSV XENLA 


C561 HUMAN 


OPSD API ME 




OPS 2 PATYE 


OPSU CARAU 


YIPC YEAST 


MAS RAT 




GP43 HUMAN 


OPS1 DROPS 


PTH2 RAT 


MAS MOUSE 




GCRC MOUSE 


OPS1 DROME 


PAFR CAVPO 


MAS HUMAN 




BRB1 HUMAN 


OPS1 CALVI 


OPSB BOVIN 


CIN6 HUMAN 




VC03 SPVKA 


MGR6 RAT 


OPS2 SCHGR 


CIN3 RAT 




PE24 RAT 


GLPR RAT 


OLF6 CHICK 


CB1R RAT 




PE24 RABIT 


ETBR MAC FA 


NY4R RAT 


CB1R MOUSE 




PE24 MOUSE 


TSHR SHEEP 


NY4R MOUSE 


CB1R HUMAN 




PE24 HUMAN 


TSHR BOVIN 


GRHR PIG 


CB1R FELCA 




YYOl CAEEL 


OPSR HORSE 


GPRM HUMAN 


CB1B FUGRU 




YR41 CAEEL 


OPSG ODOVI 


GP3 9 HUMAN 


CB1A FUGRU 




V2R RAT 


LSHR RAT 


PTR2 HUMAN 


YQGP BACSU(FP) 




V2R PIG 


LSHR MOUSE 


PE21 RAT 


VIRR AGRT6 (FP) 




V2R HUMAN 


LSHR BOVIN 


PE21 MOUSE 


PLSC COCNU 




V2R BOVIN 


GLR RAT 


PAFR HUMAN 


OPS 6 DROME 




OPS1 HEMS A 


DBDR MACMU 


OPS4 DROME 


NY5R RAT 




CML2 HUMAN 


TSHR MOUSE 


GLR HUMAN 


NY5R PIG 




CKR5 HUMAN 


TSHR HUMAN 


YMJC CAEEL 


NY5R MOUSE 
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P2Y7 HUMAN 


m O T TT-l r~l 7\ VTT7 1 7\ 

TSHR CANFA 


PE21 HUMAN 


JNibK. rlUMAJM 




OPSH_ASTFA 


SUR7 YEAST 


OPSD_CORAU 


NY5R CANFA 




OPSG AST FA 


OPSG GECGE 


0L1H HUMAN 


NTR2 RAT 




OPSD AST FA 


0LF2 CANFA 


YJZ3 YEAST 


NTR2 MOUSE 




OPS5 DROME 


MGR6 HUMAN 


PROA HAEIN(FP) 


MGR3 RAT 




MGR4 HUMAN 


GPRI HUMAN 


LSHR CHICK 


MGR3 HUMAN 




ACTR PAPHA 


GPRE RAT 


FSHR CHICK 


GUSB BOVIN 




OPSD LIMBE 


NY4R HUMAN 


PI2R MOUSE 






YXX5 CAEEL 


ET3R XENLA 


PF2R MOUSE 






AA1R MOUSE 


GRHR SHEEP 


P2YR RAT 







A Third Example: the helix-turn-helix DNA binding motif 

The third example that showcases the present invention corresponds to the 
5 helix-turn-helix motif that mediates the binding of many regulatory proteins to regulatory 
control sites of DNA. This 20 amino-acid long structural motif consists of two helices (7 
and 9 a. a. respectively) that are separated by a 4 amino acid turn that are held together 
through non-polar interactions of their side chains. It has been argued that 
sequence-based analysis using traditional approaches cannot unambiguously identify 

1 0 helix-turn-helix motifs unless it is combined with the use of stereo-chemical constraints. 
More recently, a pattern-based approach started with 91 carefully-selected, aligned 
sequence fragments that corresponded to known helix-turn-helix instances and produced 
significant results by essentially estimating a pattern-based profile for the helix-turn-helix 
binding motif. This set of 91 fragments is particularly interesting because it is a very 

1 5 diverse collection of helix-turn-helix motif instances that share very little at the sequence 
level. 

In the experiment carried out, a subset of 70 fragments from the set of 91 
were selected (excluding those of the helix-turn-helix instances that corresponded to 
pieces of homeoboxes) and no alignment information was assumed. Additionally, each 
20 of the fragments was extended to the left and to the right by including an additional 10 
amino acids, thus producing fragments that were 40 amino acids long. Again, the 
patterns were discovered assuming the equivalence classes {A, G} ? C, {D, E}, {F, Y}, H, 
{I, L, M, V}, {K, R}, {N, Q}, P, {S, T}, W. The Teiresias parameters were set to L=5, 
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W=10 whereas the successive threshold choices were K=70/5=14, K=3 and K=2. It was 
set out to discover patterns that involved at least 5 non-wild cards in any rolling window 
spanning 10 positions that begins/ends with a literal, a relatively high-degree of local 
similarity (i.e. 50% or higher). From the discovered set, those patterns whose estimated 
5 log-probability was equal to -30.0 or less were selected, thus giving rise to a composite 
descriptor with 517 patterns. Table 7 below lists the labels of the 70 fragments in this 
training set. Table 7 shows Swiss-Prot labels of the 70 sequence fragments with length 
40 a. a. in the training set for the helix-turn-helix experiment. 

Table 7 

| 1 through 20 | 21 through 4 0 | 41 through 6 0 | 61 through 70 | 



10 



BIRA ECOLI 


TNPO ECOLI 


TNP3 ECOLI 


RCRO BPP22 


CYTR ECOLI 


DNIV BPP1 


DNIV ECOLI 


VG3 0 BPPH8 


RBTR KLEAE 


VPB BPMU 


DNIV SALTY 


RPC BPPHl 


ASNC ECOLI 


LACI ECOLI 


RCRO LAMBD 


DBNE BPMU 


CRP ECOLI 


PURR ECOLI 


RPC2 LAMBD 


DBNE BPD10 


ARAC ERWCH 


DEOR ECOLI 


RCRO BP434 


RP3 2 ECOLI 


ADA ECOLI 


ARAC ECOLI 


RPCl BPP22 


RPSF BACSU 


DICC ECOLI 


FNR ECOLI 


RPCl BPPH8 


RPSE BACSU 


LYSR ECOLI 


DICA ECOLI 


RPC BP163 


RP54 KLEPN 


ILVY ECOLI 


FIS_ECOLI 


RPC BPP2 


RP54 AZOVI 


TRPI PSEAE 


METR SALTY 


VPC BPMU 




NOD2 RHIME 


AMPR ENTCL 


RPSD BUCAP 




XYLR BACSU 


NODI RHIME 


RPSA BACSU 




NIFA RHIME 


XYLS PSEPU 


RPSB BACSU 




NTRC RHIME 


NIFA KLEPN 


RP54 RHIME 




MERR STAAU 


NTRC KLEPN 


PARB ECOLI 




NAHR PSEPU 


MERR BACSR 


SOPB ECOLI 




TER2 ECOLI 


MERR PSEAE 


RPCl LAMBD 




TNP2 ECOLI 


TER3 ECOLI 


RPCl BP434 




TNPl ECOLI 


TNP5 PSEAE 


RPC2 BPP22 





The resulting DFA (deterministic finite automaton, which will only 
recognize instances of the composite descriptor patterns in a query sequence and which 
performs method step 260) was used to search the randomized version RAND-Swiss-Prot 
15 of Swiss-Prot (Release 38.0) and therein were discovered a total of 277 randomized 
sequences that received non-zero support. Of the 277 randomized sequences, 275 
received a support value that was less than or equal to 6. Thus, Thres^d was set equal to 
7. This threshold choice corresponded to the 99.2-th percentile. Fig. 4 shows the 
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histogram of the scores for the sequences of RAND-Swiss-Prot that received non-zero 
support. 

Subsequent search of the actual Swiss-Prot database gave rise to 193 
sequences that received support greater than or equal to Thres ra nd=7. The support values 
ranged from the minimum allowed value of 7 to a maximum value of 66. 

Next, the Swiss-Prot annotation (feature table "FT" lines and description 
"DE" lines) was used for each of these 193 sequences. Of these, 169 are actually listed in 
Swiss-Prot as containing a helix-turn-helix motif, 2 are listed as belonging to an H-T-H 
group from PFAM (Y4WC_RHISN, Y4AM_RHISN) and 3 are listed as having 
dna-binding properties (VR2B_BPT4) or being putative DNA replication proteins 
(Y4CK RHISN) or being a cytosine-specific methyltransferase (MTE8ECOLI). Of the 
remaining proteins, 1 is listed as hypothetical protein (YP60_METTM), 1 is listed as a 
hypothetical transcription factor containing a helix-turn-helix motif (Y558_METJA), 1 is 
listed as being involved in DNA packaging (XTMA BACSU), 1 is listed as having 
strong similarity to MJ1545 which is a putative transcription repressor protein containing 
a helix-turn-helix motif (Y014_ARCFU), 3 have very good blastp P-values with all the 
similarities confined in the helix-turn-helix region of the input fragments 
(PRPD SALTY, PRPDJECOLI, Y0FO_MYCTU), and finally, 2 are likely to be false 
positives (YOAE_ECOLI, CTPEJViYCTU). Table 8 below contains a listing of the 
labels of these 193 hits in order of decreasing value of accumulated support. Table 8 
shows the Swiss-Prot labels of the 193 sequence fragments that are discovered using the 
composite descriptor derived from the original set of 70 fragments. 



Table 8 

I 1 through 50 I 51 through 100 1 101 through 150 I 151 through 193 1 



RPSF BACSU 


RP54 CAUCR 


RPSD SERMA 


NIFA KLEOX 


RPSE BACSU 


RBSR ECOLI 


RPSD SALTY 


MERR_BACSR 


RPSF BACLI 


PURR HAEIN 


RPSD PSEFL 


HIPB ECOLI 


RP3 5 BACTK 


FIS HAEIN 


RPSD PSEAE 


FIXK BRAJA 


RPSE CLOAB 


TNP2 ECOLI 


RPSD ECOLI 


CTPE MYCTU 


RPSF BACME 


RPCl BPP22 


RPSD BUCAP 


YCIT ECOLI 


RPSG CLOAB 


RP54 AZOCA 


RPCl BPD3 


RPSD NEIGO 
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RPSB BACSU 


RP3 2 PROMI 


RP5 5 BRAJA 


RPOD STRPN 


RPC1 LAMBD 


RP32 ENTCL 


RP54 RHISN 


RPC BPPH1 


RPSG BACSU 


RP32 ECOLI 


RP54 RHILP 


REGL STRLI 


TER3 ECOLI 


RP32 CITFR 


RP32 SERMA 


PRPD SALTY 


VG3 0 BPPH8 


NTRC BRASR 


PURR ECOLI 


PRPD ECOLI 


LACI ECOLI 


NTRC AZOCA 


NTRC RHOCA 


NODD RHILE 


TER1 ECOLI 


NODI RHIGA 


MALI ECOLI 


NOD3 RHIME 


RPC BP163 


NAHR PSEPU 


EBGR ECOLI 


NOD2 RHISN 


RBTR KLEAE 


GALR SALTY 


CSCR ECOLI 


NOD2 BRAJA 


YD2 8 METTH 


GALR ECOLI 


CRP HAEIN 


MALR STRCO 


RDGA ERWCA 


FIS ECOLI 


Y014 ARCFU 


HRDA STRCO 


RPC2 BPP22 


FADR HAEIN 


XTMA BACSU 


HMOS CAE EL 


ut.yx ACTPL 


FX24 RHILV 


SCRR PEDPE 


YOAE ECOLI 


FNR SALTY 


ENDR PAEPO 


RPC2 LAMBD 


Y4CK RHISN 


FNR HAEIN 


YCJW ECOLI 


RPC2 BP434 


YOFO MYCTU 




TNP7 ECOLI 


RP54 SALTY 


TYRR HAEIN 


F.TRA SHEPU 


TNP5 PSEAE 


RP54 KLEPN 


TYRR ECOLI 


RPSD HAEIN 


NTRC AZOBR 


RP54 ECOLI 


TRA6 PSEAE 


RP C S4 PSEPU 


MTE8 ECOLI 


RP54 BRAJA 


RPSD PSEPU 


Dpci PQFAE 


CRP SALTY 


N0D2 BRAEL 


RPSD CAUCR 




CRP ECOLI 


NODI RHISN 


RPSD BACSU 


TFR9 F.POLI 


ASCG ECOLI 


NODI BRASN 


RP54 THIFE 


I\r OL* O Innu 


ADA ECOLI 


NODI BRAJA 


N0D2 RHILP 




RP54 ALCEU 


MALR STRPN 


NIFA RHOCA 


Dpqn ENTFA 


GALS ECOLI 


MALI VIBFU 


NIFA ENTAG 




SCRR VIBAL 


GNTR ECOLI 


NIFA AZOVI 




RP55 RHIME 


DBNE BPD10 


NIFA AZOCH 


RTRA 9ALTY 


RP54 RHIME 


Y55 8 METJA 


MERR STAAU 


RTRA ECOLI 


RP2 8 BACTK 


Y4WC RHISN 


ILVY SALTY 


YP6 0 METTM 


REGA CLOAB 


Y4AM RHISN 


ILVY ECOLI 


PARB ECOLI 


NODD BRASP 


Y272 METJA 


FECI ECOLI 


NTRC RHIME 


CCPA STRMU 


TRPI PSESY 


CYTR ECOLI 


NOD 2 RHIME 


ASNC ECOLI 


TRPI PSEAE 


BTR BORPE 


NODI RHIME 


SCRR STAXY 


RPSD STRAU 


ARAC ERWCH 


TER8 PASMU 


RPSK BACSU 


RP54 VIBAN 


AMPR ENTCL 


RPSD LISMO 


RPSD CLOAB 


RP54 ACICA 


AMPR CITFR 


RCRO BPP22 


RBSR BACSU 


RCRO LAMBD 




FNRL RHOSH 


KDGR BACSU 


RCRO BP434 




FIXK RHIME 


DEGA BACSU 


RAFR ECOLI 




FIXK AZOCA 


ASNC HAEIN 


NODD RHILV 




TER8 PASPI 


VR2B BPT4 


NODD RHILT 




TER4 ECOLI 


VPB BPMU 


NODI BRAEL 




RPC1 BP434 


SCRR STRMU 


NIFA KLEPN 





Starting now with the set of all 193 discovered sequence fragments, one 
more iteration of the described method was carried out using this set as the new training 
set, T. The training set for this iteration was formed by collecting the individual sequence 
fragments whose support exceeded threshold. As before, the Teiresias parameters were 
set to L=5 and W=10 whereas the successive threshold choices were K=193/5=38, K=7 
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and K=2. Sub-selecting those patterns whose estimated log-probability was equal to 
-30.0 or less produced 1,061 patterns which were added to the previous set of 517 to form 
a new augmented composite descriptor. The DFA resulting from the latter descriptor was 
applied to RAND-Swiss-Prot. Of the 537 sequence fragments that received non-zero 
support, 534 received support 9 or less thus establishing the value 10 as the new Thres ran d 
(=99.2-th percentile). Processing Swiss-Prot with this last DFA, an additional 96 
sequence fragments were discovered that exceeded threshold for a grand total of 289 
fragments. Table 9 here lists the labels for this additional set of fragments. Table 9 
shows the Swiss-Prot labels of the additional 96 sequence fragments that are discovered 
after augmenting the original composite descriptor with the patterns that are discovered 
from treating the first set of 193 discovered fragments as a training set. 



Table 9 

I 1 through 25 1 26 through 50 I 51 through 75 I 76 through 96 



HRDB STRCO 


CCPA BACSU 


FNRA PSEST 


VMEM PVSP 


EMRD ECOLI 


CCPA BACME 


ANR PSEAE 


V57A BPT4 


HRDD STRCO 


YJGS ECOLI 


YH93 ARCFU 


SP3D BACSU 


RPSD LACLA 


RP32 VIBCH 


RPOS VIBCH 


RPC BPP2 


RPSD SYNP7 


RBSR HAEIN 


RPOS PSEAE 


MALR STAXY 


RPSD MI CAE 


YFED ECOLI 


YFER ECOLI 


EBSC ENTFA 


RPSD ANASP 


RPSD RICPR 


Y701 SYNY3 


VG3 6JBPML5 


RPSD AGRTU 


RPSD BORBU 


Y4BA RHISN 


VG3 6 BPMD2 


Y01W MYCTU 


RPSD HELPY 


RP3 2 PSEAE 


PRPR SALTY 


RPSD CHLTR 


YYAA BACSU 


FRVR ECOLI 


MERB SERMA 


RPSD MYXXA 


RP54 XANCV 


ARAC SALTY 


BRPA STRHY 


RPSD TREPA 


SACR LACLA 


YG27 ARCFU 


ARAC ECOLI 


RPSD RHOCA 


NIFA RHISN 


XYLR BACSU 


ARAC CITFR 


YVDE BACSU 


NIFA RHIET 


RPSC ANASP 


ACOR ALCEU 


RPSW STRCO 


NIFA BRAJA 


NADR KLEPN 


YYAG BACSU 


Y151 METJA 


NFXB PSEAE 


YRDX RHOSH 


YSCC YEREN 


RPOS YEREN 


TRA6 BACST 


YAHB ECOLI 


XYS4 PSEPU 


RPOS SHIFL 


RP54 BACSU 


TRA4 BACFR 


XYS1 PSEPU 


RPOS SALTY 


ACRR ECOLI 


RPSC SYNY3 


XYLS PSEPU 


RPOS SALTI 


YFET ECOLI 


RP3 2 CAUCR 


THCR RHOSN 


RPOS SALDU 


RP54 TREPA 


NIFA AZOLI 


TETP CLOPE 


RPOS ECOLI 


EXPR ERWCH 


NIFA AZOBR 




PEPR LACDL 


ECHR ERWCH 


MLTD ECOLI 




GALR HAEIN 


SORC KLEPN 


AADR RHOPA 




CCPA STAXY 


RP54 RHOCA 


YDT6 SCHPO 





An analysis of the additional hits using the feature tables in Swiss-Prot 
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showed that 81 of those are true positives, 4 are listed as DNA binding 
(TRA4_BACFR,V57A_BPT4,NADR_KLEPN) or transcription regulation proteins 
(EBSC ENTFA), and 2 are listed as hypothetical proteins (YFED_ECOLI, 
YDT6_SCHPO). Finally, 8 hits probably correspond to false positives (Y4BA_RHISN, 
5 EMRD_ECOLI, VG36_BPMD2, VG36_BPML5, TETP_CLOPE, YSCC_YEREN, 
MERB_SERMA, MLTD ECOLI). 

A Fourth Example: Searching The C. elegans genome for EF1G, GPCR and HTH 
Candidates 

10 The three composite descriptors were used to search the collection of 

19,099 ORFs that were reported for the C. elegans genome (see: 
http://genome.wustl.edu/gsc/C_elegans) as of June 13, 1999. In all three cases, the 
corresponding values of Thres^d that were established by searching RAND-Swiss-Prot 
were used. 

15 

Elongation Factor 1 Gamma Chain 

First, this ORF collection was searched using the 2,260 pattern composite 
descriptor that was built for the elongation factor gamma chain (PS50040 above). Of the 
20 13 ORFs that received non-zero support only one, F17C11.9, exceeded threshold. This 
ORF is the one listed in Swiss-Prot (and in PS50040) as EF1G_CAEEL. 

G-protein Coupled Receptors 

25 Next, the C. elegans genome was searched using the composite descriptor 

for the G protein-coupled receptor that comprised 1,703 patterns. Note that for this 
particular experiment, it was not set out to discover and enumerate all putative G-protein 
coupled receptors in C. elegans but rather to show that even when starting with a small 
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knowledge base that contains no GPCR sequences from the genome under consideration 
it can be effective to mine a complete genome such as C. elegans. 

In Table 10 below, the labels of the 101 C. elegans ORFs whose support 
exceeded threshold are shown. For each of those ORFs, the Score and the P and N values 
are shown for the top scoring sequence obtained from running a BLASTP search against 
the set of 804 Swiss-Prot Rel. 35.0 sequences that are known to be true GPCRs (see also 
discussion above). Table 10 shows the 101 ORFs from C. elegans that were discovered 
using a composite descriptor for the GPCR family and whose support exceeds threshold. 
For each of the reported ORFs, also listed are the top scoring sequence from running 
blastp against the set of 804 Swiss-Prot Rel. 35 sequences that are known to be true 
GPCRs. 

Table 10 



# 


C. elegans ORF 


Top Scoring 


Score 


P 


N 


Label 


Training Set Seq. 







l 


M03F4 .3 


5H1A MOUSE 


190 


2 .300E-73 


6 


2 


K09G1.4 


D3DR RAT 


272 


4 . 100E-79 


6 


3 


K02F2.6 


OAR DROME 


235 


1.200E-59 


5 


4 


F14D12.6 


5H1A MOUSE 


214 


7 .300E-77 


6 


5 


C09B7.1 


5H1A MOUSE 


265 


1. 000E-61 


4 


6 


C02D4 .2 


OAR DROME 


292 


3 -100E-11 


5 


7 


ZK455.3 


GRPR MOUSE 


181 


6.600E-38 


5 


8 


C52B11.3 


5H1A MOUSE 


232 


6 . 200E-64 


5 


9 


F15A8 .5 


DOP1 DROME 


300 


6 .400E-85 


3 


10 


F16D3 . 7 


5 HI A MOUSE 


202 


5 .400E-44 


4 


11 


F59C12.2 


5H2A CRIGR 


221 


5.600E-65 


4 


12 


T14E8.3 


D3DR RAT 


231 


4 .400E-51 


4 


13 


T02E9.3 


D3DR RAT 


190 


1.600E-43 


5 


14 


F01E11.5 


OAR DROME 


293 


1.400E-77 


3 


15 


C53C7.1 


NY4R MOUSE 


177 


2.800E-36 


4 


16 


C30F12 .6 


SSR4 RAT 


119 


9.600E-30 


4 


17 


Y4 0H4A.a 


ACM3 PIG 


485 


5.100E-83 


2 


18 


F41E7.3 


NK2R RAT 


148 


3 .600E-39 


5 


19 


C38C10.1 


NK1R RANCA 


232 


1.800E-59 


4 


20 


ZC412.1 


NY4R MOUSE 


176 


2 .500E-30 


4 


21 


C26F1.6 


OPSB ANOCA 


71 


5 .200E-10 


3 


22 


C39B10 . 1 


AG2R HUMAN 


68 


1 . 800E-04 


2 


23 


C24A8 . 1 


D3DR RAT 


147 


8 . 500E-40 


5 


24 


T07D4.1 


NK1R RANCA 


114 


2.900E-27 


5 


25 


F55E10.7 


OPRX PIG 


113 


1.400E-15 


3 


26 


C16D6 .2 


NY4R MOUSE 


180 


6 . 400E-32 


4 


27 


C10C6 .2 


NY4R MOUSE 


170 


4 .500E-31 


3 
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28 


C15B12 . 5 


ACM1 RAT 


170 


7 . 700E-50 


7 


29 


F4 7D12 . 2 


ACM3 CHICK 


178 


2 .000E-49 


4 


30 


T23C6 . 5 


GRPR MOUSE 


102 


1.600E-25 


5 


3 1 


W05B5 2 


GRPR MOUSE 


143 


2 .400E-31 


7 


32 


T2 7D1 . 3 


NK1R RAT 


90 


3 .000E-25 


5 


33 


C4 9A9 . 7 


NK1R RAT 


318 


1.400E-65 


2 


34 


AH9 . 1 


OPSB GECGE 


62 


6 . 600E-07 


3 


3 5 


B0563 . 6 


ACM1 RAT 


94 


3 . 500E-09 


3 


36 


R106 . 2 


SSR4 RAT 


126 


2.200E-43 


5 


37 


M01E10 . 1 


IL8A RAT 


134 


2 .100E-14 


1 


3 8 


T07D10 . 2 


V1BR HUMAN 


140 


2 .300E-33 


6 


39 


F54D7 . 3 


GRHR HUMAN 


205 


3 .300E-42 


4 


40 


C50F7 . 1 


NK1R RAT 


183 


2 .000E-35 


5 


4 1 


R13H7 . 2 


5H1A MOUSE 


67 


1.600E-04 


2 


42 


Y54E2A. 1 


SSR4 RAT 


118 


1.500E-39 


5 


/ O 
f± ,5 


T07F8 . 2 


OPRX PIG 


74 


4.000E-11 


4 


Ad 
f± 1 




NY4R MOUSE 


232 


9.400E-43 


4 


ft 3 


Y~\ DRA 4 

xY± UJDfl . 1 


SSR4 RAT 


97 


6.400E-30 


4 


Afi 

f± D 


T05A1 . 1 


NY4R MOUSE 


218 


1.700E-30 


2 


/I "7 
4 / 


TfiOTTQ 1 
± U Z 11 ;? . X 


GPRO HUMAN 


106 


1.600E-23 


7 


A R 


F42C5 . 2 


SSR4 RAT 


109 


1.500E-23 


5 


A Q 




NY4R MOUSE 


136 


5.800E-32 


4 


3 U 




OPRX PIG 


51 


5-900E-07 


5 


O J- 


F47D12 . 1 


ACM3 CHICK 


104 


1.100E-15 


3 


R9 


1 

UJOuJ • -X. 


AG2R HUMAN 


188 


1.400E-26 


3 


3 -3 


T23B3 . 4 


5 HI A MOUSE 


84 


1.800E-24 


5 


D ft 


AC7 . 1 


NK1R RAT 


195 


4 .400E-46 


3 




C5 1E3 . 1 


0LF5 RAT 


83 


1 . 800E-08 


3 


3 D 


C2 5G6 . 5 


NY4R MOUSE 


208 


4.600E-42 


4 


o / 




PAFR CAVPO 


51 


6.100E-02 


2 


R ft 


C24A8 . 4 


5HTB DROME 


102 


1.500E-14 


2 






NK2R RAT 


84 


1.000E-07 


4 


60 


F14F4 . 1 


V1BR HUMAN 


109 


2 . 100E-25 


5 


o ± 


J. U £ LJ -L • D 


GPRO HUMAN 


103 


9.800E-18 


4 


62 


^w- —J J- J-J *J - ^-i 


OPSD CATBO 


80 


4 .200E-07 


2 


D J 


yc qpq ii Q b 


AA3R HUMAN 


60 


9 .200E-04 


1 


f. A 
D ft 


H07I12 3 


AA1R CHICK 


86 


6 .300E-13 


4 




r jdxjo - -j 


SSR4 RAT 


119 


4.500E-35 


5 


O D 


VI 1 fiARR R 

1 XI OllO Xj • _J 


SSR4 RAT 


113 


7.100E-23 


5 


£ 1 

o / 


V^ft ft-D / . O 


GRHR HUMAN 


38 


1.800E-01 


3 


68 




(mm pat 


51 


4 . 900E-03 


2 


69 


TO O "HI 1 O 
1 ZZU± . LZ 


IN X *± x\. 1 iv UJi-j 


210 


4 . 500E-41 


4 


70 






57 


3 . 00OE-07 


4 


71 


F21C10 . 9 


SSR4 RAT 


71 


2.500E-10 


4 


72 


F5 9A1.12 


CRFR CHICK 


42 


3 .600E-01 


3 


73 


F4 0A3.7 


AA1R CAVPO 


47 


1.300E-03 


3 


74 


T19F4 . 1 


ML1C CHICK 


62 


6 . 600E-09 


4 


75 


F59B2 .13 


SSR4 RAT 


71 


1 . 900E-05 


3 


76 


F54E4 .1 


CASR HUMAN 


51 


9 .000E-01 


1 


77 


F54D1.5 


CASR HUMAN 


49 


8 .200E-01 


1 


78 


C54A12 . 2 


OPSB ANOCA 


58 


7 .400E-08 


3 


79 


C53A5.12 


ACM3 RAT 


275 


1 . 000E-33 


1 


1 80 


1 Y58G8A.208.a 


NY4R MOUSE 


186 


4 .600E-38 


4 



-45- 



81 


Yl 05C5 . v 


EBI2 HUMAN 


129 


9.700E-17 


2 




T?RR6 2 


NK1R RAT 


53 


1.200E-01 


2 


O J 


T22G5 . 4 


GRPR MOUSE 


70 


4 .300E-09 


3 


O 1 


F3 1B9 . 1 


NK1R RANCA 


106 


1.000E-27 


5 


O D 




ACM2 HUMAN 


46 


1.400E-01 


3 


Q C 
O D 


H09F14 1 


SSR4 RAT 


73 


2.400E-12 


4 


o / 


C4 5H4 . 3 


AG2R HUMAN 


46 


1.000E-01 


2 


o O 


Y71 HI 9A 199 


VI PR MELGA 


51 


4.100E-02 


2 


R Q 
O j? 


T23H2 . 3 


DOP1 DROME 


55 


2 .200E-01 


1 




F59A7 . 8 


NK1R RANCA 


52 


2 .300E-01 


1 


Q 1 




MC4R RAT 


67 


6 . 800E-07 


4 


92 


F21G4 .2 


OAR DROME 


53 


3 .400E-01 


2 


93 


C15H11.2 


V1BR HUMAN 


89 


5.900E-20 


5 


94 


C10F3 .3 


OL1J HUMAN 


46 


1.200E-02 


4 


95 


C06B3.11 


B3AR MOUSE 


53 


1 . 700E-04 


3 


96 


Y77E11A.3443 


MSHR HUMAN 


59 


1 . 000E-02 


1 


97 


K09C6 .5 


MC4R RAT 


41 


5.200E-02 


3 


98 


Y41D4B.3805. 


GRHR HUMAN 


70 


1.600E-04 


1 


99 


Y40H7.d 


OAR DROME 


37 


9.700E-01 


1 


100 


Y116Fll.zz8 


SSR4 RAT 


53 


4.800E-02 


2 


101 


T26E4 .15 


AG2R HUMAN 


79 


2 .400E-09 


3 



In addition to the above 101 C. elegans ORFs that exceeded threshold and 
as testimony to the stringent thresholds use, there is also listed in Table 11 below an 
additional 19 ORFs whose scores were just below threshold and which generated 
blast-search P values that were significant. As before, the blast searches were carried out 
against the set of 804 Swiss-Prot Rel. 35.0 known true GPCRs. Table 11 shows an 
additional 19 ORFs from C. elegans that receive scores just below threshold but show 
significant blast-search P values when compared against the set of 804 true GPCRs from 
Rel. 35.0 of Swiss-Prot. 



Table 11 



# 


C. elegans ORF 


Top Scoring 


Score 


P 


N 


Label 


Training Set Seq. 







102 


T26E4 . 14 


AG2R HUMAN 


73 


8 . 10E-08 


2 


103 


M01B2.7 


NK1R RANCA 


60 


4 . 10E-09 


4 


104 


K03H6.5 


SSR4 RAT 


52 


6.50E-05 


4 


105 


F58D7.1 


SSR4 RAT 


55 


9.50E-06 


4 


106 


F57H12.4 


D3DR RAT 


80 


9 .40E-10 


3 


107 


F53A9 .5 


AA1R CAVPO 


78 


5 . 50E-06 


1 


109 


F02E8 .2 


SSR4 RAT 


90 


6 . 50E-21 


5 


110 


C51E3 .4 


OPSD CATBO 


83 


3 .10E-06 


1 


111 


C02H7 .2 


5H2A CRIGR 


61 


8 . 00E-09 


4 
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113 


Y34D9A. 152 .d 


NY4R MOUSE 


56 


2 .10E-08 


4 


114 


K06B4.9 


ML IX HUMAN 


73 


2 .60E-06 


2 


118 


F3 7E3 .2 


GPCR LYMST 


127 


2 .70E-11 


3 


119 


F35F10 .2 


PAFR CAVPO 


51 


8 .50E-04 


3 


124 


C06G4.5 


OPRX PIG 


164 


3 .50E-23 


4 


128 


Y57A10C.8 


OLF9 RAT 


65 


2.80E-04 


2 


132 


Y4C6A.h 


MGR8 HUMAN 


347 


2 .30E-215 


1 


134 


R07B5.5 


A2AA PIG 


53 


1.40E-04 


3 


135 


M01B2 .9 


AG2R HUMAN 


73 


1.50E-07 


2 


140 


C51E3 .3 


AA1R CHICK 


73 


5 .70E-05 


1 



Several comments are in order here. First, it should be stressed that the 
above analysis is not implying that there is only 120 G-protein coupled receptors in C. 

5 elegans. Instead, what is attempted to be demonstrated is that even if one begins with a 
small knowledge base of only 80 known GPCRs that have been selected randomly, one 
can still build a pretty useful composite descriptor for the family and use it to explore a 
largely-unexplored genome such as C. elegans. In order to have a complete enumeration 
of the GPCRs that are present in C. elegans, the composite descriptor should be built by 

1 0 using all of the GPCRs that are present in GPCRDB and not only 80 of them. Second, it 
was opted to run the BLAST searches against the set of 804 sequences in order to show 
the ability of the proposed method to extrapolate. As such, blast-search results with P 
values that are relatively high (e.g. E-02) should not be surprising since the target 
database of 804 true GPCRs is but a small fraction of the current contents of GPCRDB. 

15 Indeed the November 1999 release of GPCRDB contained 1,704 GPCR sequences and 
431 GPCR sequence fragments for a grand total of 2,135 entries. 

Helix-Turn-Helix 

20 Finally, the 19,099 ORFs of C. elegans was searched for instances of the 

helix-tum-helix binding motif using the corresponding 2,288 (-1,896+392) pattern 
composite descriptor. Of the 169 sequences that received non-zero support, only 5 
exceeded threshold: Y94H6A_142.g (in the region delineated by a.a. 65 through 95), 
C16C2.1 (in the region delineated by a.a. 59 through 89), F18C5.2 (in the region 
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10 



15 



delineated by a.a. 850 through 880), Y39F10A.a (in the region delineated by a.a. 125 
through 155), Y48C3A.S (in the region delineated by a.a. 113 through 143), and 
Y48C3A.S (in the region delineated by a.a. 113 through 143), 
The fragments were: 

>Y94H6A_142 .g fragment 

I FDNTNDLVAS LLGISSI TVYRKRKRI GEE 

>C16C2.1 fragment 

YLSGSTRAKLAESLGLSDNQVKVWFQNRRT 

>F18C5.2 fragment 

I S RS TAKE VAT ARG I S EGTV YS YLAMAVE K 

>Y3 9F10A.a fragment 

LSAYTISDLAKHFNVSKIEILKIDIEGAEL 

>Y48C3A.s fragment 

NEVLNLNEVAKELNI SKRRVYDVINVLEGL 



and their respective top-scoring sequences from the training set of 70 helix-turn helix 
segments, blast scores, P and N values are: 



20 



# 


C. elegans ORP 


Top Scoring 


Scor 


P 


Nj 




1 


Y94H6A_142.g 


RPSF BACSU 


50 


2.80E-06 




2 


C16C2.1 


TER3 ECOLI 


45 


1.30E-05 




3 


F18C5.2 


VBP BPMU 


47 


9.30E-06 




4 


Y39F10A.a 


TNP0 ECOLI 


39 


1.10E-04 




5 


Y48C3A.S 


TNP1 ECOLI 


49 


6.40E-06 





25 



It is to be understood that the embodiments and variations shown and 
described herein are merely illustrative of the principles of this invention and that various 
modifications may be implemented by those skilled in the art without departing from the 
scope and spirit of the invention. 
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What is claimed is: 



Claims 



1 . A method comprising the steps of: 

5 providing a set of sequences, wherein the sequences are not aligned; 

discovering a plurality of patterns common to a plurality of the sequences; 

and 

determining if a candidate sequence comprises a predetermined number of 

the patterns. 

10 

2. The method of claim 1, wherein the patterns common to a plurality of the 
set of sequences comprise test patterns, wherein the sequences in set of sequences 
comprise test sequences, and wherein the step of determining if a candidate sequence 
comprises a predetermined number of the patterns comprises the step of determining if 

15 there are candidate patterns in the candidate sequence that match all of the predetermined 
number of test patterns. 

3. The method of claim 1, further comprising the step of determining if each 
of the plurality of patterns is statistically significant. 

20 

4. The method of claim 1, wherein the step of discovering is performed 
without any knowledge about properties or features of sequences in the set of unaligned 
sequences. 
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5. The method of claim 1, further comprising the steps of if the candidate 
sequence comprises the predetermined number of patterns, adding the candidate sequence 
to the set of sequences to create a new set of sequences and performing the step of 
discovering on the new set of sequences. 

5 

6. The method of claim 1, wherein each sequence comprises a series of 
symbols and wherein each pattern comprises a plurality of positions, some of the 
positions each comprising at least one expected symbol and other of the positions 
comprising "don't care" positions. 

10 

7. The method of claim 6, wherein, for one of the positions, the at least one 
expected symbol is a plurality of expected symbols. 

8. The method of claim 3, wherein the step of determining if each of the 
15 plurality of patterns is statistically significant comprises the steps of selecting one of the 

patterns, determining if a probability that the selected pattern occurs in a sequence meets 
a predetermined threshold, and continuing to select additional patterns until each pattern 
has been selected. 

20 9. The method of claim 8, wherein the step of determining if a probability 

that the selected pattern occurs in a sequence meets a predetermined threshold further 
comprises the steps of using a second-order Markov chain method to determine the 
probability that the selected pattern occurs in a sequence and determining a natural 
logarithm of the probability that the selected pattern occurs in a sequence. 
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10. The method of claim 3, wherein the step of determining if each of the 
plurality of patterns is statistically significant further comprises the steps of removing 
instances of each of the patterns from the set of sequences to create a new set of 
sequences and performing the step of discovering on the new set of sequences. 

5 

11. The method of claim 3, wherein the step of determining if each of the 
plurality of patterns is statistically significant further comprises the steps of if any of the 
patterns is statistically significant, selecting a statistically significant pattern, modifying a 
composite descriptor to include the selected pattern if the selected pattern is not already 

10 part of the composite descriptor, and continuing to select statistically significant patterns 
until all statistically significant patterns have been selected. 

12. The method of claim 1, wherein the step of discovering a plurality of 
patterns common to a plurality of the sequences comprises the steps of: 

15 selecting a predetermined threshold that indicates how many of the 

sequences should contain a pattern for the pattern to be considered common; 

discovering patterns, if any, that are common to the predetermined 
threshold of sequences; 

if there are no patterns common to the predetermined threshold of 
20 sequences, decreasing the predetermined threshold; and 

performing, until the predetermined threshold is less than a predetermined 
amount, the step of discovering patterns, if any, that are common to the predetermined 
threshold of sequences and the step of if there are no patterns common to the 
predetermined threshold of sequences, decreasing the predetermined threshold. 
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13. A method for unsupervised building and exploitation of composite 

descriptors, the method comprising the steps of: 

i. providing a training set of sequences, each sequence 

comprising a plurality of symbols; 
5 ii. determining a set of maximal patterns, each of the maximal 

patterns being common to a predetermined number of the sequences, 
wherein the step of determining a set of maximal patterns is performed 
without any knowledge about properties or features of sequences in the set 
of unaligned sequences; 

10 iii. determining which, if any, of the maximal patterns are 

statistically significant; and 

iv. creating a composite descriptor from the statistically 

significant maximal patterns. 

15 14. The method of claim 13, wherein the sequences in the training set are 

unaligned. 

15. The method of claim 13, wherein the step of creating a composite 

descriptor from the statistically significant maximal patterns further comprises the steps 
20 determining which of the statistically significant maximal patterns are currently not part 
of the composite descriptor, adding those statistically significant maximal patterns that 
are currently not part of the composite descriptor to the composite descriptor, and 
removing the added statistically significant maximal patterns from the training set of 
sequences. 



25 



16. The method of claim 15, wherein each symbol comes from an alphabet 

that describes DNA (deoxyribonucleic acid) or proteins. 
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17. 



The method of claim 13, wherein the symbols are numerical 



18. The method of claim 15, further comprising the steps of iterating steps (ii) 
through (iv) until either the training set contains no sequences or there are no statistically 

5 significant maximal patterns common to the sequences in the training set. 

19. The method of claim 15, further comprises the step of determining if a 
candidate sequence comprises a predetermined number of the statistically significant 
maximal patterns. 

10 

20. The method of claim 19, comprising the steps of if the candidate sequence 
comprises the predetermined number of the statistically significant maximal patterns, 
adding the candidate sequence to the set of sequences to create a new training set of 
sequences and performing the steps (ii) through (iv) on the new training set of sequences. 

15 

21. The method of claim 13, wherein the step of determining which, if any, of 
the maximal patterns are statistically significant comprises the step of determining for 
each of the maximal patterns if a probability that this maximal pattern occurs in a 
sequence meets a predetermined threshold. 

20 

22. The method of claim 13, wherein the set of maximal patterns is empty and 
wherein the step of determining a set of maximal patterns further comprises the steps of 
reducing the predetermined number of sequences and performing step (ii) again. 
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10 



23, A system comprising: 

a memory that stores computer-readable code; and 

a processor operatively coupled to said memory, said processor configured 
to implement said computer-readable code, said computer-readable code configured to: 
provide a set of sequences, wherein the sequences are not aligned; 
discover a plurality of patterns common to a plurality of the sequences; 



and 



the patterns. 



determine if a candidate sequence comprises a predetermined number of 



24. A system for unsupervised building and exploitation of composite 

descriptors, comprising: 

a memory that stores computer-readable code; and 

a processor operatively coupled to said memory, said processor configured 
15 to implement said computer-readable code, said computer-readable code configured to: 

i. provide a training set of sequences, each sequence 
comprising a plurality of alphabetic symbols; 

ii. determine a set of maximal patterns, each of the maximal 
patterns being common to a predetermined number of the sequences, 

20 wherein the maximal patterns are determined without any knowledge 

about properties or features of sequences in the set of unaligned sequences; 

iii. determine which, if any, of the maximal patterns are 
statistically significant; and 

iv. create a composite descriptor from the statistically 
25 significant maximal patterns. 
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25. An article of manufacture comprising: 

a computer readable medium having computer readable code means 
embodied thereon, said computer readable program code means comprising: 

a step to provide a set of sequences, wherein the sequences are not aligned; 
5 a step to discover a plurality of patterns common to a plurality of the 

sequences; and 

a step to determine if a candidate sequence comprises a predetermined 
number of the patterns. 

10 26. An article of manufacture for unsupervised building and exploitation of 

composite descriptors, comprising: 

a computer readable medium having computer readable code means 
embodied thereon, said computer readable program code means comprising: 

a step to provide a training set of sequences, each sequence comprising a 
1 5 plurality of alphabetic symbols; 

a step to determine a set of maximal patterns, each of the maximal patterns 
being common to a predetermined number of the sequences, wherein the maximal 
patterns are determined without any knowledge about properties or features of sequences 
in the set of unaligned sequences; 
20 a step to determine which, if any, of the maximal patterns are statistically 

significant; and 

a step to create a composite descriptor from the statistically significant 
maximal patterns. 



25 
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UNSUPERVISED BUILDING AND EXPLOITATION OF COMPOSITE 

DESCRIPTORS 



Abstract of the Disclosure 

5 Generally, the present invention provides a way of determining in an 

unsupervised manner additional members for a family that is defined initially through 
exemplar sequences. The present invention is unsupervised in that it proceeds without 
any information related to the exemplar sequences defining the family, without aligning 
the sequences, without prior knowledge of any patterns in the exemplar sequences, and 

10 without knowledge of the cardinality or characteristics of any features that may be present 
in the exemplar sequences. In one aspect of the invention, a method is used to take a set 
of unaligned sequences and discover several or many patterns common to some or all of 
the sequences. These patterns can then be used to determine if candidate sequences are 
members of the family. In another aspect of the invention, a method is used to take a set 

15 of sequences and to determine a set of maximal patterns common to a number of 
sequences. The maximal patterns are determined without any previous knowledge about 
any properties or features that may be present in the processed sequences. 

20 1500-1 48. APP 
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AS A BELOW NAMED INVENTOR, I hereby declare that: 

My residence, post office address and citizenship are as stated next to my name. 

I believe that I am the original, first and sole (if only one name is listed below), or an original, first and joint inventor (if plural 
names are listed below), of the subject matter which is claimed and for which a patent is sought on the invention entitled: 
TITLE: UNSUPERVISED BUILDING AND EXPLOITATION OF COMPOSITE DESCRIPTORS 

the specification of which is attached hereto or indicates an attorney docket no., or: 
□ was filed in the U.S. Patent & Trademark Office on and assigned Serial No., 



□ and (if applicable) was amended on . • 

I hereby state that I have reviewed and understand the contents of the above-identified specification, including the claims, as 

amended by any amendment referred to above. I acknowledge the duty to disclose information which is material to patentability and 

to the examination of this application in accordance with Title 37, Code of Federal Regulations §1.56. I hereby clarm foreign pnonty 

benefits under Title 35, U.S. Code §119(a)-(d) or §365(b) of any foreign application(s) for patent or inventor's certificate, or §365(a) 

of any PCT international application which designated at least one country other than the United States, or §1 19(e) of any United States 

provisional application(s), listed below and have also identified below any foreign applications for patent or inventor's certificate having 

a filing date before that of the application on which priority is claimed: 

6 Priority Claimed : 

. Yes[] No[ ] 

(Applimtion Number) (Country) (DayMonthJYear filed) 

Yes [ ] No[ ] 

(AppliUtion Number) (Country) (Day/MonthA 'ear filed) 

ft hereby claim the benefit under Title 35, U.S. Code §120, of any United States application(s), or §365(c), of any PCT 
International application designating the United States, listed below and, insofar as the subject matter of each of the claims of this 
applicaSon is not disclosed in the prior United States or PCT International applications(s) in the manner provided by the first paragraph 
of Title 35 U S Code §1 12, 1 acknowledge the duty to disclose information material to patentability as defined in Title 37, Code ot 
Fede^Regulations §1.56 which became available between the filing date of the prior application and the national or PCT international 
filing elate of this application: 



(Application Serial Number) (Filing Date) (STATUS: patented, pending, abandoned) 



(Application Serial Number) (Filing Date) (STATUS: patented, pending, abandoned) 

I hereby appoint the following attorneys: MANNY W. SCHECTER, Reg. No. 31,722; LAUREN BRUZZONE, Reg. No. 35,082; CHRKTOPHER 
A ITOG^tegX^eSSSwARD A. PENNINGTON, Reg. No. 32,588; JOHN E. HOEL, Reg. No. 26,279; JOSEPH C. REDMOND Jr Reg No. 
18 753 DOUGLAS W CAMERON, Reg. No. 31,596; LOUIS P. HERZBERG, Reg. No. 41,500; STEPHEN C. KAUFMAN, Reg. No. 29,551; DANIEL , P. 
MOBKB R« No 32 053^PAUL J. OTTERSTEDT, Reg. No. 37,411; LOUIS J. PERCELLO, Reg. No. 33,206; ROBERT M. TREPP Reg. No. 25,933; 
IS UNIJElSrSER, Reg- No. 46,134; eak of them of INTERNATIONAL BUSINESS MACHINES CORPORATION, Thomas J. Watson 
Research Center, P.O. Box 218, Yorktown Heights, New York 10598; to prosecute this application and to transact all business in the U.S. Patent and Trademark 
Office connected therewith and with any divisional, continuation, continuation-in-part, reissue or re-examination application with fall power of appointment and 
wU4 full power to substitute an associate attorney or agent, and to receive all patents which may issue thereon, and request that all correspondence be addressed 



Robert J. Mauri 

RYAN, MASON & LEWIS, LLP 
1300 Post Road, Suite 205 
Fairfield, CT 06430 
Tel.: (203)255-6560 
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I HEREBY DECLARE that all statements made herein of my own knowledge are true and that all statements made on information and 
belief are believed to be true; and further that these statements were made with the knowledge that willful false statements and the like 
so made are punishable by fine or imprisonment, or both, under §1001 of Title 18 U.S. Code and that such willful false statements may 
jeopardize the validity of the application or any patent issued thereon. 

FULL NAME OF FIRST OR SOLE INVENTOR : Isidore Rigoutsos Citizenship Greece 

Inventor's signature: Date: 

30-30 36 th Street 
Astoria, NY 11103 



Residence & Post Office address: 30-30 36 th Street 



FULL NAME OF SECOND JOINT INVENTOR: Yuan Gao Citizenship People's Republic of China 

Inventor's signature: Date: 

Residence & Post Office address: 611 Half Moon Bay Dr. 

Croton On Hudson, NY 10520 

FULL NAME OF THIRD JOINT INVENTOR: Aristidis Floratos Citizenship Greece 

Inventor's signature: Date: 

31-68 35 th Street 
Long Island City, NY 11106 



Residence & Post Office address: 31-68 35 th Street 
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